A Unified Zeroth-Order Approach for Decentralized Minimax Optimization

Aleksandar Armacki; Ali H. Sayed; Haoyuan Cai; Jie Chen; Yike Zhao

arxiv: 2606.12124 · v1 · pith:CHBAVRFCnew · submitted 2026-06-10 · 🧮 math.OC

A Unified Zeroth-Order Approach for Decentralized Minimax Optimization

Haoyuan Cai , Yike Zhao , Aleksandar Armacki , Jie Chen , Ali H. Sayed This is my paper

Pith reviewed 2026-06-27 08:45 UTC · model grok-4.3

classification 🧮 math.OC

keywords zeroth-order optimizationdecentralized optimizationminimax optimizationPolyak-Lojasiewicz conditiongradient trackingacceleration techniquesmulti-agent systemsgradient-free methods

0 comments

The pith

ZOMA unifies zeroth-order estimators, bias corrections, and accelerations for decentralized minimax optimization under the nonconvex Polyak-Lojasiewicz condition.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ZOMA as a single framework that combines a hybrid zeroth-order estimator with multiple bias-correction strategies and acceleration techniques for multi-agent minimax problems where only function values are available. This unification produces several new decentralized algorithms and supplies convergence guarantees that match those of centralized zeroth-order minimax methods. A sympathetic reader would value the result because the approach yields linear speedup as the number of agents grows while remaining applicable to gradient-free settings. The framework further supplies a method to specialize rates to particular problem structures and algorithmic choices.

Core claim

The central claim is that a multi-level unified framework called ZOMA, incorporating a hybrid zeroth-order estimator that accommodates coordinate-wise and randomized uniform smoothing, subsumes bias-correction strategies including gradient tracking, exact diffusion, and EXTRA, and facilitates acceleration techniques including zeroth-order versions of STORM, PAGE, and L2S, enables novel decentralized zeroth-order minimax methods with unified convergence guarantees that match state-of-the-art centralized zeroth-order minimax methods and provide linear speed-up in the number of users for nonconvex Polyak-Lojasiewicz minimax optimization.

What carries the argument

The ZOMA framework, which unifies a hybrid zeroth-order estimator with bias-correction strategies and acceleration techniques to derive and analyze decentralized minimax algorithms from function values alone.

If this is right

The framework generates many novel decentralized zeroth-order minimax methods.
Convergence guarantees match those of state-of-the-art centralized zeroth-order minimax methods.
Linear speed-up occurs in the number of users.
The unified rates supply a systematic way to assess suitability for specific problem structures and method designs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Designers could mix estimator types with different bias corrections to fit hardware constraints while retaining the same theoretical rates.
The same unification pattern could be tested on stochastic or constrained minimax problems beyond the current nonconvex PL setting.
Simulations could be extended to measure wall-clock time benefits in distributed hardware where communication costs dominate.

Load-bearing premise

The hybrid zeroth-order estimator can be paired with any of the listed bias-correction strategies and acceleration techniques while preserving the stated convergence rates under only the nonconvex Polyak-Lojasiewicz condition.

What would settle it

Numerical results on a nonconvex Polyak-Lojasiewicz minimax problem where a specific pairing of the hybrid estimator with EXTRA bias correction and STORM acceleration fails to achieve the claimed rate would falsify the unified guarantee.

Figures

Figures reproduced from arXiv: 2606.12124 by Aleksandar Armacki, Ali H. Sayed, Haoyuan Cai, Jie Chen, Yike Zhao.

**Figure 2.** Figure 2: Comparison between coordinate-wise (CW) variants of [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 3.** Figure 3: Comparison between hybrid variants of ZOMA and centralized method ZO-VRGDA [42] in the smooth case (Setting 3) with the dataset ijcnn1 (left) and a9a (right). VI. CONCLUSION In this work, we proposed the first unified ZO framework for decentralized minimax optimization. The proposed framework integrates various decentralized strategies and ZO acceleration techniques, achieving improved guarantees over exi… view at source ↗

**Figure 4.** Figure 4: Simulation results on training a fair classifier with the FashionMNIST [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

read the original abstract

We propose ZOMA, a unified Zeroth-Order decentralized accelerated MinimAx framework for multi-agent nonconvex Polyak--\L{}ojasiewicz minimax optimization. The proposed framework only requires evaluating the function value and, as such, is tailored to gradient-free environments, where exact gradient information is either unavailable or computationally prohibitive to obtain. A central contribution of our \textbf{ZOMA} framework is a multi-level unification, along the following directions: (i) \emph{estimator} - our framework adopts a hybrid zeroth-order estimator, which accommodates, among others, both coordinate-wise and randomized uniform smoothing estimators; (ii) \emph{bias correction} - our framework subsumes a wide range of bias-correction strategies, including gradient tracking (GT), exact diffusion (ED), and EXTRA and (iii) \emph{acceleration} - our framework facilitates a broad class of acceleration techniques, including zeroth-order versions of STORM, PAGE, and L2S. The general nature of \textbf{ZOMA} leads to many novel decentralized zeroth-order minimax methods and allows us to establish unified convergence guarantees, matching the performance of state-of-the-art centralized zeroth-order minimax methods, while providing benefits, such as linear speed-up in the number of users. The unified framework also provides a systematic way to assess algorithmic suitability by specializing the convergence rates to specific problem structures and method designs. We validate the performance of the proposed algorithms via numerical simulations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ZOMA packages several zeroth-order choices into one decentralized minimax framework but the abstract leaves open whether every estimator-bias-acceleration combo really shares the same rates under only nonconvex-PL.

read the letter

The main takeaway is that the paper introduces ZOMA as a single analysis that covers a hybrid zeroth-order estimator (coordinate-wise or uniform smoothing), several bias corrections (GT, ED, EXTRA), and accelerations (STORM, PAGE, L2S) for multi-agent nonconvex minimax problems under the Polyak-Łojasiewicz condition. It claims this produces new methods, matches centralized zeroth-order rates, and delivers linear speedup in the number of agents.

What the work does reasonably is lay out a modular structure so that different combinations can be plugged in and their rates compared in one place. That kind of packaging can help researchers see which design choices matter for a given problem without starting from scratch each time. The mention of numerical simulations to check performance is also a standard and useful step.

The soft spot is the unification claim itself. The abstract states that the framework subsumes all those combinations while preserving the stated rates under only the nonconvex-PL assumption, yet it gives no derivation outline or explicit conditions on how the bias and variance terms stay controlled when the estimator is paired with every listed correction and accelerator. If some pairings force extra Lipschitz factors, network-diameter terms, or different smoothing-parameter regimes, the rates would no longer be uniform and the “unified guarantees” would be weaker than presented. Without seeing the actual proofs it is impossible to tell whether the analysis handles this cleanly or whether the compatibility is assumed rather than shown.

This paper is mainly for people already working on distributed zeroth-order or minimax methods who want a reference point for combining existing pieces. A reader looking for a systematic design space could get something out of the framework idea, but anyone needing the rates to be taken at face value would have to wait for the full derivations.

I would send it to peer review so the proofs and the uniformity of the rates can be checked directly.

Referee Report

1 major / 1 minor

Summary. The paper proposes ZOMA, a unified Zeroth-Order decentralized accelerated MinimAx framework for multi-agent nonconvex Polyak-Łojasiewicz minimax optimization. It unifies a hybrid zeroth-order estimator (accommodating coordinate-wise and randomized uniform smoothing), bias-correction strategies (GT, ED, EXTRA), and acceleration techniques (zeroth-order STORM, PAGE, L2S). The framework establishes unified convergence guarantees matching state-of-the-art centralized zeroth-order minimax methods, with linear speedup in the number of users, and is validated via numerical simulations.

Significance. If the central unification holds with uniform rates across combinations under the stated assumptions, the work would provide a systematic framework for designing and analyzing decentralized zeroth-order minimax algorithms, enabling assessment of suitability for specific problem structures. The explicit multi-level unification and matching of centralized performance with decentralized linear speedup would be a notable contribution in gradient-free multi-agent optimization.

major comments (1)

[Abstract and §1] Abstract and §1 (Introduction): The claim that the hybrid zeroth-order estimator can be paired with any listed bias-correction (GT/ED/EXTRA) and acceleration (STORM/PAGE/L2S) while preserving the stated convergence rates under only the nonconvex Polyak-Łojasiewicz condition is load-bearing for the unification contribution. The analysis must explicitly verify that bias/variance bounds remain uniform without introducing case-specific network-diameter factors or extra smoothness requirements; otherwise the rates are not truly unified.

minor comments (1)

[Numerical simulations] The numerical simulations section could specify which estimator-bias-acceleration combinations were tested to better illustrate the unification.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful reading and for identifying the load-bearing aspect of the unification claim. We address the major comment below and are prepared to strengthen the presentation of uniformity if needed.

read point-by-point responses

Referee: [Abstract and §1] Abstract and §1 (Introduction): The claim that the hybrid zeroth-order estimator can be paired with any listed bias-correction (GT/ED/EXTRA) and acceleration (STORM/PAGE/L2S) while preserving the stated convergence rates under only the nonconvex Polyak-Łojasiewicz condition is load-bearing for the unification contribution. The analysis must explicitly verify that bias/variance bounds remain uniform without introducing case-specific network-diameter factors or extra smoothness requirements; otherwise the rates are not truly unified.

Authors: The unified analysis (Section 4 and Appendix B) derives bias and variance bounds for the hybrid zeroth-order estimator in a general form that depends only on the smoothing radius, the PL constant, and the standard assumptions on the bias-correction operators. These bounds are shown to hold uniformly across all listed combinations of GT/ED/EXTRA with STORM/PAGE/L2S; no additional network-diameter factors appear beyond those already embedded in the bias-correction analysis itself, and no extra smoothness is imposed. The resulting rates therefore match the centralized zeroth-order minimax rates uniformly while retaining the linear speedup. We can add an explicit remark or short subsection in Section 3 that restates this uniformity for clarity. revision: partial

Circularity Check

0 steps flagged

No circularity: ZOMA unification is definitional by framework construction; convergence rates presented as derived, not reduced to self-citations or fits.

full rationale

The provided abstract defines ZOMA explicitly to subsume hybrid estimators, GT/ED/EXTRA bias corrections, and STORM/PAGE/L2S accelerations, then states that this leads to unified convergence guarantees matching centralized ZO methods. No equations appear that would make any claimed rate equivalent to its inputs by construction, no self-citations are referenced as load-bearing, and no fitted parameters are renamed as predictions. The derivation chain is therefore self-contained; the unification is an explicit design choice whose correctness is to be verified externally rather than tautological.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review performed on abstract only; no explicit free parameters, invented entities, or additional axioms are visible beyond the problem class stated in the abstract.

axioms (1)

domain assumption The target problem satisfies the nonconvex Polyak-Łojasiewicz condition for minimax optimization.
Explicitly named in the abstract as the setting in which ZOMA operates.

pith-pipeline@v0.9.1-grok · 5806 in / 1344 out tokens · 17072 ms · 2026-06-27T08:45:23.881138+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

63 extracted references · 3 linked inside Pith

[1]

Zoo: Zeroth order optimization based black-box attacks to deep neural networks without training substitute models,

P.-Y . Chen, H. Zhang, Y . Sharma, J. Yi, and C.-J. Hsieh, “Zoo: Zeroth order optimization based black-box attacks to deep neural networks without training substitute models,” inACM Workshop on Artificial Intelligence and Security, 2017, pp. 15–26

2017
[2]

Autozoom: Autoencoder-based zeroth order optimization method for attacking black-box neural networks,

C.-C. Tu, P. Ting, P.-Y . Chen, S. Liu, H. Zhang, J. Yi, C.-J. Hsieh, and S.-M. Cheng, “Autozoom: Autoencoder-based zeroth order optimization method for attacking black-box neural networks,” inAAAI Conference on Artificial Intelligence, vol. 33, no. 01, 2019, pp. 742–749

2019
[3]

Fine-tuning language models with just forward passes,

S. Malladi, T. Gao, E. Nichani, A. Damian, J. D. Lee, D. Chen, and S. Arora, “Fine-tuning language models with just forward passes,” in Advances in Neural Information Processing Systems, vol. 36, 2023, pp. 53 038–53 075. 10 Fig. 4. Simulation results on training a fair classifier with the FashionMNIST dataset. We report the test accuracy of thehybridZO va...

2023
[4]

Revisiting zeroth- order optimization for memory-efficient llm fine-tuning: A benchmark,

Y . Zhang, P. Li, J. Hong, J. Li, Y . Zhang, W. Zheng, P.-Y . Chen, J. D. Lee, W. Yin, M. Hong, Z. Wang, S. Liu, and T. Chen, “Revisiting zeroth- order optimization for memory-efficient llm fine-tuning: A benchmark,” arXiv:2402.11592, 2024

arXiv 2024
[5]

A decentralized parallel algorithm for training generative adversarial nets,

M. Liu, W. Zhang, Y . Mroueh, X. Cui, J. Ross, T. Yang, and P. Das, “A decentralized parallel algorithm for training generative adversarial nets,” inAdvances in Neural Information Processing Systems, vol. 33, 2020, pp. 11 056–11 070

2020
[6]

Communication-efficient distributed stochastic auc maximization with deep neural networks,

Z. Guo, M. Liu, Z. Yuan, L. Shen, W. Liu, and T. Yang, “Communication-efficient distributed stochastic auc maximization with deep neural networks,” inInternational Conference on Machine Learn- ing, 2020, pp. 3864–3874

2020
[7]

Robust multi- agent reinforcement learning via minimax deep deterministic policy gradient,

S. Li, Y . Wu, X. Cui, H. Dong, F. Fang, and S. Russell, “Robust multi- agent reinforcement learning via minimax deep deterministic policy gradient,” inAAAI Conference on Artificial Intelligence, vol. 33, no. 01, 2019, pp. 4213–4220

2019
[8]

Improved zeroth-order vari- ance reduced algorithms and analysis for nonconvex optimization,

K. Ji, Z. Wang, Y . Zhou, and Y . Liang, “Improved zeroth-order vari- ance reduced algorithms and analysis for nonconvex optimization,” in International Conference on Machine Learning, 2019, pp. 3100–3109

2019
[9]

Min-max optimization without gradients: Convergence and applications to black-box evasion and poisoning attacks,

S. Liu, S. Lu, X. Chen, Y . Feng, K. Xu, A. Al-Dujaili, M. Hong, and U.-M. O’Reilly, “Min-max optimization without gradients: Convergence and applications to black-box evasion and poisoning attacks,” inInter- national Conference on Machine Learning, 2020, pp. 6282–6293

2020
[10]

Zeroth-order stochastic variance reduction for nonconvex optimization,

S. Liu, B. Kailkhura, P.-Y . Chen, P. Ting, S. Chang, and L. Amini, “Zeroth-order stochastic variance reduction for nonconvex optimization,” inAdvances in Neural Information Processing Systems, vol. 31, 2018, pp. 3731–3741

2018
[11]

On gradient descent ascent for nonconvex-concave minimax problems,

T. Lin, C. Jin, and M. I. Jordan, “On gradient descent ascent for nonconvex-concave minimax problems,” inInternational Conference on Machine Learning, 2020, pp. 6083–6093

2020
[12]

Efficient methods for structured nonconvex-nonconcave min-max optimization,

J. Diakonikolas, C. Daskalakis, and M. I. Jordan, “Efficient methods for structured nonconvex-nonconcave min-max optimization,” inInter- national Conference on Artificial Intelligence and Statistics, 2021, pp. 2746–2754

2021
[13]

A unified single-loop alternat- ing gradient projection algorithm for nonconvex–concave and convex– nonconcave minimax problems,

Z. Xu, H. Zhang, Y . Xu, and G. Lan, “A unified single-loop alternat- ing gradient projection algorithm for nonconvex–concave and convex– nonconcave minimax problems,”Mathematical Programming, vol. 201, no. 1, pp. 635–706, 2023

2023
[14]

Faster single-loop algo- rithms for minimax optimization without strong concavity,

J. Yang, A. Orvieto, A. Lucchi, and N. He, “Faster single-loop algo- rithms for minimax optimization without strong concavity,” inInter- national Conference on Artificial Intelligence and Statistics, 2022, pp. 5485–5517

2022
[15]

Universal gradient descent ascent method for nonconvex-nonconcave minimax optimization,

T. Zheng, L. Zhu, A. M.-C. So, J. Blanchet, and J. Li, “Universal gradient descent ascent method for nonconvex-nonconcave minimax optimization,” inAdvances in Neural Information Processing Systems, vol. 36, 2023, pp. 54 075–54 110

2023
[16]

Accelerated stochas- tic min-max optimization based on bias-corrected momentum,

H. Cai, S. A. Alghunaim, and A. H. Sayed, “Accelerated stochas- tic min-max optimization based on bias-corrected momentum,” arXiv:2406.13041, 2024

Pith/arXiv arXiv 2024
[17]

Enhanced adaptive gradient algorithms for nonconvex-pl minimax optimization,

F. Huang, C. Xuan, X. Wang, S. Zhang, and S. Chen, “Enhanced adaptive gradient algorithms for nonconvex-pl minimax optimization,” in International Conference on Artificial Intelligence and Statistics, 2025, pp. 3439–3447

2025
[18]

Two-timescale gradient descent ascent algorithms for nonconvex minimax optimization,

T. Lin, C. Jin, and M. I. Jordan, “Two-timescale gradient descent ascent algorithms for nonconvex minimax optimization,”Journal of Machine Learning Research, vol. 26, no. 11, pp. 1–45, 2025

2025
[19]

Tiada: A time-scale adaptive algorithm for nonconvex minimax optimization,

X. Li, J. Yang, and N. He, “Tiada: A time-scale adaptive algorithm for nonconvex minimax optimization,”arXiv:2210.17478, 2022

arXiv 2022
[20]

A faster decentralized algorithm for nonconvex minimax problems,

W. Xian, F. Huang, Y . Zhang, and H. Huang, “A faster decentralized algorithm for nonconvex minimax problems,” inAdvances in Neural Information Processing Systems, vol. 34, 2021, pp. 25 865–25 877

2021
[21]

Decentralized stochastic gradient descent ascent for finite-sum minimax problems,

H. Gao, “Decentralized stochastic gradient descent ascent for finite-sum minimax problems,”arXiv:2212.02724, 2022

arXiv 2022
[22]

Variance-reduced accelerated methods for decentralized stochastic double-regularized nonconvex strongly-concave minimax problems,

G. Mancino-Ball and Y . Xu, “Variance-reduced accelerated methods for decentralized stochastic double-regularized nonconvex strongly-concave minimax problems,”arXiv:2307.07113, 2023

arXiv 2023
[23]

Jointly improving the sample and communication complexities in decentralized stochastic minimax optimization,

X. Zhang, G. Mancino-Ball, N. S. Aybat, and Y . Xu, “Jointly improving the sample and communication complexities in decentralized stochastic minimax optimization,” inAAAI Conference on Artificial Intelligence, vol. 38, no. 18, 2024, pp. 20 865–20 873

2024
[24]

Decentralized gradient descent maximization method for com- posite nonconvex strongly-concave minimax problems,

Y . Xu, “Decentralized gradient descent maximization method for com- posite nonconvex strongly-concave minimax problems,”SIAM Journal on Optimization, vol. 34, no. 1, pp. 1006–1044, 2024

2024
[25]

Communication-efficient algorithms for distributed nonconvex minimax optimization problems,

H. Cai, S. A. Alghunaim, and A. H. Sayed, “Communication-efficient algorithms for distributed nonconvex minimax optimization problems,” arXiv:2507.21901, 2025

Pith/arXiv arXiv 2025
[26]

DAMA: A unified accelerated approach for decentralized non- convex minimax optimization-part I: Algorithm development and re- sults,

——, “DAMA: A unified accelerated approach for decentralized non- convex minimax optimization-part I: Algorithm development and re- sults,”arXiv:2512.13920, 2025

arXiv 2025
[27]

DAMA: A unified accelerated approach for decentralized non- convex minimax optimization-part II: Convergence and performance analyses,

——, “DAMA: A unified accelerated approach for decentralized non- convex minimax optimization-part II: Convergence and performance analyses,”arXiv:2512.13923, 2025

arXiv 2025
[28]

Stochastic first-and zeroth-order methods for nonconvex stochastic programming,

S. Ghadimi and G. Lan, “Stochastic first-and zeroth-order methods for nonconvex stochastic programming,”SIAM Journal on Optimization, vol. 23, no. 4, pp. 2341–2368, 2013

2013
[29]

Random gradient-free minimization of convex functions,

Y . Nesterov and V . Spokoiny, “Random gradient-free minimization of convex functions,”Foundations of Computational Mathematics, vol. 17, no. 2, pp. 527–566, 2017

2017
[30]

On the information-adaptive variants of admm: An iteration complexity perspective,

X. Gao, B. Jiang, and S. Zhang, “On the information-adaptive variants of admm: An iteration complexity perspective,”Journal of Scientific Computing, vol. 76, no. 1, pp. 327–363, 2018

2018
[31]

Zeroth-order online alternating direction method of multipliers: Convergence analysis and applications,

S. Liu, J. Chen, P.-Y . Chen, and A. Hero, “Zeroth-order online alternating direction method of multipliers: Convergence analysis and applications,” inInternational Conference on Artificial Intelligence and Statistics, 2018, pp. 288–297

2018
[32]

A primer on zeroth-order optimization in signal processing and machine learning: Principals, recent advances, and applications,

S. Liu, P.-Y . Chen, B. Kailkhura, G. Zhang, A. O. Hero, and P. K. Varshney, “A primer on zeroth-order optimization in signal processing and machine learning: Principals, recent advances, and applications,” IEEE Signal Processing Magazine, vol. 37, no. 5, pp. 43–54, 2020

2020
[33]

A zeroth-order variance-reduced method for decentralized stochastic non-convex optimization,

H. Chen, J. Chen, and K. Wei, “A zeroth-order variance-reduced method for decentralized stochastic non-convex optimization,”Optimization, pp. 1–43, 2025

2025
[34]

Zeroth-order algorithms for stochastic distributed nonconvex optimization,

X. Yi, S. Zhang, T. Yang, and K. H. Johansson, “Zeroth-order algorithms for stochastic distributed nonconvex optimization,”Automatica, vol. 142, p. 110353, 2022

2022
[35]

Distributed zero-order algorithms for nonconvex multiagent optimization,

Y . Tang, J. Zhang, and N. Li, “Distributed zero-order algorithms for nonconvex multiagent optimization,”IEEE Transactions on Control of Network Systems, vol. 8, no. 1, pp. 269–281, 2020

2020
[36]

Distributed zeroth or- der optimization over random networks: A Kiefer-Wolfowitz stochastic approximation approach,

A. K. Sahu, D. Jakovetic, D. Bajovic, and S. Kar, “Distributed zeroth or- der optimization over random networks: A Kiefer-Wolfowitz stochastic approximation approach,” inIEEE Conference on Decision and Control. IEEE, 2018, pp. 4951–4958

2018
[37]

Decentralized zeroth-order constrained stochas- tic optimization algorithms: Frank–Wolfe and variants with applications to black-box adversarial attacks,

A. K. Sahu and S. Kar, “Decentralized zeroth-order constrained stochas- tic optimization algorithms: Frank–Wolfe and variants with applications to black-box adversarial attacks,”Proceedings of the IEEE, vol. 108, no. 11, pp. 1890–1905, 2020

1905
[38]

A single-loop smoothed gradient descent-ascent algorithm for nonconvex-concave min-max problems,

J. Zhang, P. Xiao, R. Sun, and Z. Luo, “A single-loop smoothed gradient descent-ascent algorithm for nonconvex-concave min-max problems,” in Advances in Neural Information Processing Systems, vol. 33, 2020, pp. 7377–7389

2020
[39]

Stochastic recursive gradi- ent descent ascent for stochastic nonconvex-strongly-concave minimax problems,

L. Luo, H. Ye, Z. Huang, and T. Zhang, “Stochastic recursive gradi- ent descent ascent for stochastic nonconvex-strongly-concave minimax problems,” inAdvances in Neural Information Processing Systems, vol. 33, 2020, pp. 20 566–20 577. 11

2020
[40]

Zeroth- order algorithms for nonconvex–strongly-concave minimax problems with improved complexities,

Z. Wang, K. Balasubramanian, S. Ma, and M. Razaviyayn, “Zeroth- order algorithms for nonconvex–strongly-concave minimax problems with improved complexities,”Journal of Global Optimization, vol. 87, no. 2, pp. 709–740, 2023

2023
[41]

Accelerated zeroth-order and first-order momentum methods from mini to minimax optimization,

F. Huang, S. Gao, J. Pei, and H. Huang, “Accelerated zeroth-order and first-order momentum methods from mini to minimax optimization,” Journal of Machine Learning Research, vol. 23, no. 36, pp. 1–70, 2022

2022
[42]

Gradient free minimax optimization: Variance reduction and faster convergence,

T. Xu, Z. Wang, Y . Liang, and H. V . Poor, “Gradient free minimax optimization: Variance reduction and faster convergence,” arXiv:2006.09361, 2020

arXiv 2006
[43]

Zeroth-order alternating gradient descent ascent algorithms for a class of nonconvex-nonconcave minimax problems,

Z. Xu, Z.-Q. Wang, J.-L. Wang, and Y .-H. Dai, “Zeroth-order alternating gradient descent ascent algorithms for a class of nonconvex-nonconcave minimax problems,”Journal of Machine Learning Research, vol. 24, no. 313, pp. 1–25, 2023

2023
[44]

Robust and faster zeroth-order minimax optimization: complexity and applications,

W. An, Y . Liu, F. Shang, and H. Liu, “Robust and faster zeroth-order minimax optimization: complexity and applications,” inAdvances in Neural Information Processing Systems, vol. 37, 2024, pp. 37 050– 37 069

2024
[45]

Near-optimal decentralized momentum method for nonconvex-pl minimax problems,

F. Huang and S. Chen, “Near-optimal decentralized momentum method for nonconvex-pl minimax problems,”arXiv:2304.10902, 2023

arXiv 2023
[46]

Federated stochastic minimax optimization under heavy-tailed noises,

X. Zhang and H. Gao, “Federated stochastic minimax optimization under heavy-tailed noises,”arXiv:2511.04456, 2025

arXiv 2025
[47]

A unified and refined convergence analysis for non-convex decentralized learning,

S. A. Alghunaim and K. Yuan, “A unified and refined convergence analysis for non-convex decentralized learning,”IEEE Transactions on Signal Processing, vol. 70, pp. 3264–3279, 2022

2022
[48]

A. H. Sayed,Inference and Learning from Data. Cambridge University Press, 2022

2022
[49]

An efficient stochastic algorithm for decentralized nonconvex-strongly-concave minimax optimization,

L. Chen, H. Ye, and L. Luo, “An efficient stochastic algorithm for decentralized nonconvex-strongly-concave minimax optimization,” in International Conference on Artificial Intelligence and Statistics, 2024, pp. 1990–1998

2024
[50]

Momentum-based variance reduction in non-convex sgd,

A. Cutkosky and F. Orabona, “Momentum-based variance reduction in non-convex sgd,” inAdvances in Neural Information Processing Systems, vol. 32, 2019, pp. 15 236 – 15 245

2019
[51]

Page: A simple and optimal probabilistic gradient estimator for nonconvex optimization,

Z. Li, H. Bao, X. Zhang, and P. Richt ´arik, “Page: A simple and optimal probabilistic gradient estimator for nonconvex optimization,” in International Conference on Machine Learning, 2021, pp. 6286–6295

2021
[52]

On the convergence of sarah and beyond,

B. Li, M. Ma, and G. B. Giannakis, “On the convergence of sarah and beyond,” inInternational Conference on Artificial Intelligence and Statistics, 2020, pp. 223–233

2020
[53]

Achieving geometric convergence for distributed optimization over time-varying graphs,

A. Nedi ´c, A. Olshevsky, and W. Shi, “Achieving geometric convergence for distributed optimization over time-varying graphs,”SIAM Journal on Optimization, vol. 27, no. 4, pp. 2597–2633, 2017

2017
[54]

Exact diffusion for distributed optimization and learning—part I: Algorithm development,

K. Yuan, B. Ying, X. Zhao, and A. H. Sayed, “Exact diffusion for distributed optimization and learning—part I: Algorithm development,” IEEE Transactions on Signal Processing, vol. 67, no. 3, pp. 708–723, 2018

2018
[55]

Extra: An exact first-order algorithm for decentralized consensus optimization,

W. Shi, Q. Ling, G. Wu, and W. Yin, “Extra: An exact first-order algorithm for decentralized consensus optimization,”SIAM Journal on Optimization, vol. 25, no. 2, pp. 944–966, 2015

2015
[56]

Solving a class of non-convex min-max games using iterative first order methods,

M. Nouiehed, M. Sanjabi, T. Huang, J. D. Lee, and M. Razaviyayn, “Solving a class of non-convex min-max games using iterative first order methods,” inAdvances in Neural Information Processing Systems, vol. 32, 2019, pp. 14 905–14 916

2019
[57]

Loss landscapes and optimization in over-parameterized non-linear systems and neural networks,

C. Liu, L. Zhu, and M. Belkin, “Loss landscapes and optimization in over-parameterized non-linear systems and neural networks,”Applied and Computational Harmonic Analysis, vol. 59, pp. 85–116, 2022

2022
[58]

Global convergence and variance reduction for a class of nonconvex-nonconcave minimax problems,

J. Yang, N. Kiyavash, and N. He, “Global convergence and variance reduction for a class of nonconvex-nonconcave minimax problems,” in Advances in Neural Information Processing Systems, vol. 33, 2020, pp. 1153–1165

2020
[59]

Agnostic federated learning,

M. Mohri, G. Sivek, and A. T. Suresh, “Agnostic federated learning,” in International Conference on Machine Learning, 2019, pp. 4615–4625

2019
[60]

Solving a class of non-convex minimax optimization in federated learning,

X. Wu, J. Sun, Z. Hu, A. Zhang, and H. Huang, “Solving a class of non-convex minimax optimization in federated learning,” inAdvances in Neural Information Processing Systems, vol. 36, 2024

2024
[61]

Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms,

H. Xiao, K. Rasul, and R. V ollgraf, “Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms,” arXiv:1708.07747, 2017. 12 APPENDIXA NOTATION To facilitate analysis, we first introduce the concatenated variables, i.e., z≜[x;y],z≜[x;y]∈R d1+d2 .(18) The averaged variable and gradient estimator are defined as zc,i ≜[x c,i;y c,i]≜ 1...

Pith/arXiv arXiv 2017
[62]

For convenience, let R≜Q k(x+µ xu, y;ξ)−Q k(z;ξ)− ⟨∇ xQk(z;ξ), µ xu⟩,(144) whereu,ξare independent ofz

Note that E qx i (zk,i;ξ j k,i,u j k,i)−q x i (zk,i−1;ξ j k,i,u j k,i) 2 =E d1(Qk(xk,i +µ xuj k,i,y k,i;ξ j k,i)−Q k(zk,i;ξ j k,i)) µx uj k,i − d1(Qk(xk,i−1 +µ xuj k,i,y k,i−1;ξ j k,i)−Q k(zk,i−1;ξ j k,i)) µx uj k,i 2 (a) =d 2 1E Qk(xk,i +µ xuj k,i,y k,i;ξ j k,i)−Q k(zk,i;ξ j k,i)− ⟨∇ xQk(zk,i;ξ j k,i), µxuj k,i⟩ µx uj k,i − Qk(xk,i−1 +µ xuj k,i,y k,i−1;ξ...
[63]

(181) Proof.According to Lemma 1, given aL-smooth functionP(z), thenP µx(x)≜E u∼Unil(Bd1)P(x+µ xu)isL ′(≤L)-smooth, it follows that Pµx(xc,i+1)≤P µx(xc,i) +⟨∇P µx(xc,i),x c,i+1 −x c,i⟩+ Lη2 x 2 ∥gx c,i∥2 ≤P µx(xc,i)−η x⟨∇Pµx(xc,i),g x c,i⟩+ Lη2 x 2 ∥gx c,i∥2 ≤P µx(xc,i)− ηx 2 ∥∇Pµx(xc,i)∥2 − ηx 2 ∥gx c,i∥2 + ηx 2 ∥∇Pµx(xc,i)−g x c,i∥2 + Lη2 x 2 ∥gx c,i∥2....

[1] [1]

Zoo: Zeroth order optimization based black-box attacks to deep neural networks without training substitute models,

P.-Y . Chen, H. Zhang, Y . Sharma, J. Yi, and C.-J. Hsieh, “Zoo: Zeroth order optimization based black-box attacks to deep neural networks without training substitute models,” inACM Workshop on Artificial Intelligence and Security, 2017, pp. 15–26

2017

[2] [2]

Autozoom: Autoencoder-based zeroth order optimization method for attacking black-box neural networks,

C.-C. Tu, P. Ting, P.-Y . Chen, S. Liu, H. Zhang, J. Yi, C.-J. Hsieh, and S.-M. Cheng, “Autozoom: Autoencoder-based zeroth order optimization method for attacking black-box neural networks,” inAAAI Conference on Artificial Intelligence, vol. 33, no. 01, 2019, pp. 742–749

2019

[3] [3]

Fine-tuning language models with just forward passes,

S. Malladi, T. Gao, E. Nichani, A. Damian, J. D. Lee, D. Chen, and S. Arora, “Fine-tuning language models with just forward passes,” in Advances in Neural Information Processing Systems, vol. 36, 2023, pp. 53 038–53 075. 10 Fig. 4. Simulation results on training a fair classifier with the FashionMNIST dataset. We report the test accuracy of thehybridZO va...

2023

[4] [4]

Revisiting zeroth- order optimization for memory-efficient llm fine-tuning: A benchmark,

Y . Zhang, P. Li, J. Hong, J. Li, Y . Zhang, W. Zheng, P.-Y . Chen, J. D. Lee, W. Yin, M. Hong, Z. Wang, S. Liu, and T. Chen, “Revisiting zeroth- order optimization for memory-efficient llm fine-tuning: A benchmark,” arXiv:2402.11592, 2024

arXiv 2024

[5] [5]

A decentralized parallel algorithm for training generative adversarial nets,

M. Liu, W. Zhang, Y . Mroueh, X. Cui, J. Ross, T. Yang, and P. Das, “A decentralized parallel algorithm for training generative adversarial nets,” inAdvances in Neural Information Processing Systems, vol. 33, 2020, pp. 11 056–11 070

2020

[6] [6]

Communication-efficient distributed stochastic auc maximization with deep neural networks,

Z. Guo, M. Liu, Z. Yuan, L. Shen, W. Liu, and T. Yang, “Communication-efficient distributed stochastic auc maximization with deep neural networks,” inInternational Conference on Machine Learn- ing, 2020, pp. 3864–3874

2020

[7] [7]

Robust multi- agent reinforcement learning via minimax deep deterministic policy gradient,

S. Li, Y . Wu, X. Cui, H. Dong, F. Fang, and S. Russell, “Robust multi- agent reinforcement learning via minimax deep deterministic policy gradient,” inAAAI Conference on Artificial Intelligence, vol. 33, no. 01, 2019, pp. 4213–4220

2019

[8] [8]

Improved zeroth-order vari- ance reduced algorithms and analysis for nonconvex optimization,

K. Ji, Z. Wang, Y . Zhou, and Y . Liang, “Improved zeroth-order vari- ance reduced algorithms and analysis for nonconvex optimization,” in International Conference on Machine Learning, 2019, pp. 3100–3109

2019

[9] [9]

Min-max optimization without gradients: Convergence and applications to black-box evasion and poisoning attacks,

S. Liu, S. Lu, X. Chen, Y . Feng, K. Xu, A. Al-Dujaili, M. Hong, and U.-M. O’Reilly, “Min-max optimization without gradients: Convergence and applications to black-box evasion and poisoning attacks,” inInter- national Conference on Machine Learning, 2020, pp. 6282–6293

2020

[10] [10]

Zeroth-order stochastic variance reduction for nonconvex optimization,

S. Liu, B. Kailkhura, P.-Y . Chen, P. Ting, S. Chang, and L. Amini, “Zeroth-order stochastic variance reduction for nonconvex optimization,” inAdvances in Neural Information Processing Systems, vol. 31, 2018, pp. 3731–3741

2018

[11] [11]

On gradient descent ascent for nonconvex-concave minimax problems,

T. Lin, C. Jin, and M. I. Jordan, “On gradient descent ascent for nonconvex-concave minimax problems,” inInternational Conference on Machine Learning, 2020, pp. 6083–6093

2020

[12] [12]

Efficient methods for structured nonconvex-nonconcave min-max optimization,

J. Diakonikolas, C. Daskalakis, and M. I. Jordan, “Efficient methods for structured nonconvex-nonconcave min-max optimization,” inInter- national Conference on Artificial Intelligence and Statistics, 2021, pp. 2746–2754

2021

[13] [13]

A unified single-loop alternat- ing gradient projection algorithm for nonconvex–concave and convex– nonconcave minimax problems,

Z. Xu, H. Zhang, Y . Xu, and G. Lan, “A unified single-loop alternat- ing gradient projection algorithm for nonconvex–concave and convex– nonconcave minimax problems,”Mathematical Programming, vol. 201, no. 1, pp. 635–706, 2023

2023

[14] [14]

Faster single-loop algo- rithms for minimax optimization without strong concavity,

J. Yang, A. Orvieto, A. Lucchi, and N. He, “Faster single-loop algo- rithms for minimax optimization without strong concavity,” inInter- national Conference on Artificial Intelligence and Statistics, 2022, pp. 5485–5517

2022

[15] [15]

Universal gradient descent ascent method for nonconvex-nonconcave minimax optimization,

T. Zheng, L. Zhu, A. M.-C. So, J. Blanchet, and J. Li, “Universal gradient descent ascent method for nonconvex-nonconcave minimax optimization,” inAdvances in Neural Information Processing Systems, vol. 36, 2023, pp. 54 075–54 110

2023

[16] [16]

Accelerated stochas- tic min-max optimization based on bias-corrected momentum,

H. Cai, S. A. Alghunaim, and A. H. Sayed, “Accelerated stochas- tic min-max optimization based on bias-corrected momentum,” arXiv:2406.13041, 2024

Pith/arXiv arXiv 2024

[17] [17]

Enhanced adaptive gradient algorithms for nonconvex-pl minimax optimization,

F. Huang, C. Xuan, X. Wang, S. Zhang, and S. Chen, “Enhanced adaptive gradient algorithms for nonconvex-pl minimax optimization,” in International Conference on Artificial Intelligence and Statistics, 2025, pp. 3439–3447

2025

[18] [18]

Two-timescale gradient descent ascent algorithms for nonconvex minimax optimization,

T. Lin, C. Jin, and M. I. Jordan, “Two-timescale gradient descent ascent algorithms for nonconvex minimax optimization,”Journal of Machine Learning Research, vol. 26, no. 11, pp. 1–45, 2025

2025

[19] [19]

Tiada: A time-scale adaptive algorithm for nonconvex minimax optimization,

X. Li, J. Yang, and N. He, “Tiada: A time-scale adaptive algorithm for nonconvex minimax optimization,”arXiv:2210.17478, 2022

arXiv 2022

[20] [20]

A faster decentralized algorithm for nonconvex minimax problems,

W. Xian, F. Huang, Y . Zhang, and H. Huang, “A faster decentralized algorithm for nonconvex minimax problems,” inAdvances in Neural Information Processing Systems, vol. 34, 2021, pp. 25 865–25 877

2021

[21] [21]

Decentralized stochastic gradient descent ascent for finite-sum minimax problems,

H. Gao, “Decentralized stochastic gradient descent ascent for finite-sum minimax problems,”arXiv:2212.02724, 2022

arXiv 2022

[22] [22]

Variance-reduced accelerated methods for decentralized stochastic double-regularized nonconvex strongly-concave minimax problems,

G. Mancino-Ball and Y . Xu, “Variance-reduced accelerated methods for decentralized stochastic double-regularized nonconvex strongly-concave minimax problems,”arXiv:2307.07113, 2023

arXiv 2023

[23] [23]

Jointly improving the sample and communication complexities in decentralized stochastic minimax optimization,

X. Zhang, G. Mancino-Ball, N. S. Aybat, and Y . Xu, “Jointly improving the sample and communication complexities in decentralized stochastic minimax optimization,” inAAAI Conference on Artificial Intelligence, vol. 38, no. 18, 2024, pp. 20 865–20 873

2024

[24] [24]

Decentralized gradient descent maximization method for com- posite nonconvex strongly-concave minimax problems,

Y . Xu, “Decentralized gradient descent maximization method for com- posite nonconvex strongly-concave minimax problems,”SIAM Journal on Optimization, vol. 34, no. 1, pp. 1006–1044, 2024

2024

[25] [25]

Communication-efficient algorithms for distributed nonconvex minimax optimization problems,

H. Cai, S. A. Alghunaim, and A. H. Sayed, “Communication-efficient algorithms for distributed nonconvex minimax optimization problems,” arXiv:2507.21901, 2025

Pith/arXiv arXiv 2025

[26] [26]

DAMA: A unified accelerated approach for decentralized non- convex minimax optimization-part I: Algorithm development and re- sults,

——, “DAMA: A unified accelerated approach for decentralized non- convex minimax optimization-part I: Algorithm development and re- sults,”arXiv:2512.13920, 2025

arXiv 2025

[27] [27]

DAMA: A unified accelerated approach for decentralized non- convex minimax optimization-part II: Convergence and performance analyses,

——, “DAMA: A unified accelerated approach for decentralized non- convex minimax optimization-part II: Convergence and performance analyses,”arXiv:2512.13923, 2025

arXiv 2025

[28] [28]

Stochastic first-and zeroth-order methods for nonconvex stochastic programming,

S. Ghadimi and G. Lan, “Stochastic first-and zeroth-order methods for nonconvex stochastic programming,”SIAM Journal on Optimization, vol. 23, no. 4, pp. 2341–2368, 2013

2013

[29] [29]

Random gradient-free minimization of convex functions,

Y . Nesterov and V . Spokoiny, “Random gradient-free minimization of convex functions,”Foundations of Computational Mathematics, vol. 17, no. 2, pp. 527–566, 2017

2017

[30] [30]

On the information-adaptive variants of admm: An iteration complexity perspective,

X. Gao, B. Jiang, and S. Zhang, “On the information-adaptive variants of admm: An iteration complexity perspective,”Journal of Scientific Computing, vol. 76, no. 1, pp. 327–363, 2018

2018

[31] [31]

Zeroth-order online alternating direction method of multipliers: Convergence analysis and applications,

S. Liu, J. Chen, P.-Y . Chen, and A. Hero, “Zeroth-order online alternating direction method of multipliers: Convergence analysis and applications,” inInternational Conference on Artificial Intelligence and Statistics, 2018, pp. 288–297

2018

[32] [32]

A primer on zeroth-order optimization in signal processing and machine learning: Principals, recent advances, and applications,

S. Liu, P.-Y . Chen, B. Kailkhura, G. Zhang, A. O. Hero, and P. K. Varshney, “A primer on zeroth-order optimization in signal processing and machine learning: Principals, recent advances, and applications,” IEEE Signal Processing Magazine, vol. 37, no. 5, pp. 43–54, 2020

2020

[33] [33]

A zeroth-order variance-reduced method for decentralized stochastic non-convex optimization,

H. Chen, J. Chen, and K. Wei, “A zeroth-order variance-reduced method for decentralized stochastic non-convex optimization,”Optimization, pp. 1–43, 2025

2025

[34] [34]

Zeroth-order algorithms for stochastic distributed nonconvex optimization,

X. Yi, S. Zhang, T. Yang, and K. H. Johansson, “Zeroth-order algorithms for stochastic distributed nonconvex optimization,”Automatica, vol. 142, p. 110353, 2022

2022

[35] [35]

Distributed zero-order algorithms for nonconvex multiagent optimization,

Y . Tang, J. Zhang, and N. Li, “Distributed zero-order algorithms for nonconvex multiagent optimization,”IEEE Transactions on Control of Network Systems, vol. 8, no. 1, pp. 269–281, 2020

2020

[36] [36]

Distributed zeroth or- der optimization over random networks: A Kiefer-Wolfowitz stochastic approximation approach,

A. K. Sahu, D. Jakovetic, D. Bajovic, and S. Kar, “Distributed zeroth or- der optimization over random networks: A Kiefer-Wolfowitz stochastic approximation approach,” inIEEE Conference on Decision and Control. IEEE, 2018, pp. 4951–4958

2018

[37] [37]

Decentralized zeroth-order constrained stochas- tic optimization algorithms: Frank–Wolfe and variants with applications to black-box adversarial attacks,

A. K. Sahu and S. Kar, “Decentralized zeroth-order constrained stochas- tic optimization algorithms: Frank–Wolfe and variants with applications to black-box adversarial attacks,”Proceedings of the IEEE, vol. 108, no. 11, pp. 1890–1905, 2020

1905

[38] [38]

A single-loop smoothed gradient descent-ascent algorithm for nonconvex-concave min-max problems,

J. Zhang, P. Xiao, R. Sun, and Z. Luo, “A single-loop smoothed gradient descent-ascent algorithm for nonconvex-concave min-max problems,” in Advances in Neural Information Processing Systems, vol. 33, 2020, pp. 7377–7389

2020

[39] [39]

Stochastic recursive gradi- ent descent ascent for stochastic nonconvex-strongly-concave minimax problems,

L. Luo, H. Ye, Z. Huang, and T. Zhang, “Stochastic recursive gradi- ent descent ascent for stochastic nonconvex-strongly-concave minimax problems,” inAdvances in Neural Information Processing Systems, vol. 33, 2020, pp. 20 566–20 577. 11

2020

[40] [40]

Zeroth- order algorithms for nonconvex–strongly-concave minimax problems with improved complexities,

Z. Wang, K. Balasubramanian, S. Ma, and M. Razaviyayn, “Zeroth- order algorithms for nonconvex–strongly-concave minimax problems with improved complexities,”Journal of Global Optimization, vol. 87, no. 2, pp. 709–740, 2023

2023

[41] [41]

Accelerated zeroth-order and first-order momentum methods from mini to minimax optimization,

F. Huang, S. Gao, J. Pei, and H. Huang, “Accelerated zeroth-order and first-order momentum methods from mini to minimax optimization,” Journal of Machine Learning Research, vol. 23, no. 36, pp. 1–70, 2022

2022

[42] [42]

Gradient free minimax optimization: Variance reduction and faster convergence,

T. Xu, Z. Wang, Y . Liang, and H. V . Poor, “Gradient free minimax optimization: Variance reduction and faster convergence,” arXiv:2006.09361, 2020

arXiv 2006

[43] [43]

Zeroth-order alternating gradient descent ascent algorithms for a class of nonconvex-nonconcave minimax problems,

Z. Xu, Z.-Q. Wang, J.-L. Wang, and Y .-H. Dai, “Zeroth-order alternating gradient descent ascent algorithms for a class of nonconvex-nonconcave minimax problems,”Journal of Machine Learning Research, vol. 24, no. 313, pp. 1–25, 2023

2023

[44] [44]

Robust and faster zeroth-order minimax optimization: complexity and applications,

W. An, Y . Liu, F. Shang, and H. Liu, “Robust and faster zeroth-order minimax optimization: complexity and applications,” inAdvances in Neural Information Processing Systems, vol. 37, 2024, pp. 37 050– 37 069

2024

[45] [45]

Near-optimal decentralized momentum method for nonconvex-pl minimax problems,

F. Huang and S. Chen, “Near-optimal decentralized momentum method for nonconvex-pl minimax problems,”arXiv:2304.10902, 2023

arXiv 2023

[46] [46]

Federated stochastic minimax optimization under heavy-tailed noises,

X. Zhang and H. Gao, “Federated stochastic minimax optimization under heavy-tailed noises,”arXiv:2511.04456, 2025

arXiv 2025

[47] [47]

A unified and refined convergence analysis for non-convex decentralized learning,

S. A. Alghunaim and K. Yuan, “A unified and refined convergence analysis for non-convex decentralized learning,”IEEE Transactions on Signal Processing, vol. 70, pp. 3264–3279, 2022

2022

[48] [48]

A. H. Sayed,Inference and Learning from Data. Cambridge University Press, 2022

2022

[49] [49]

An efficient stochastic algorithm for decentralized nonconvex-strongly-concave minimax optimization,

L. Chen, H. Ye, and L. Luo, “An efficient stochastic algorithm for decentralized nonconvex-strongly-concave minimax optimization,” in International Conference on Artificial Intelligence and Statistics, 2024, pp. 1990–1998

2024

[50] [50]

Momentum-based variance reduction in non-convex sgd,

A. Cutkosky and F. Orabona, “Momentum-based variance reduction in non-convex sgd,” inAdvances in Neural Information Processing Systems, vol. 32, 2019, pp. 15 236 – 15 245

2019

[51] [51]

Page: A simple and optimal probabilistic gradient estimator for nonconvex optimization,

Z. Li, H. Bao, X. Zhang, and P. Richt ´arik, “Page: A simple and optimal probabilistic gradient estimator for nonconvex optimization,” in International Conference on Machine Learning, 2021, pp. 6286–6295

2021

[52] [52]

On the convergence of sarah and beyond,

B. Li, M. Ma, and G. B. Giannakis, “On the convergence of sarah and beyond,” inInternational Conference on Artificial Intelligence and Statistics, 2020, pp. 223–233

2020

[53] [53]

Achieving geometric convergence for distributed optimization over time-varying graphs,

A. Nedi ´c, A. Olshevsky, and W. Shi, “Achieving geometric convergence for distributed optimization over time-varying graphs,”SIAM Journal on Optimization, vol. 27, no. 4, pp. 2597–2633, 2017

2017

[54] [54]

Exact diffusion for distributed optimization and learning—part I: Algorithm development,

K. Yuan, B. Ying, X. Zhao, and A. H. Sayed, “Exact diffusion for distributed optimization and learning—part I: Algorithm development,” IEEE Transactions on Signal Processing, vol. 67, no. 3, pp. 708–723, 2018

2018

[55] [55]

Extra: An exact first-order algorithm for decentralized consensus optimization,

W. Shi, Q. Ling, G. Wu, and W. Yin, “Extra: An exact first-order algorithm for decentralized consensus optimization,”SIAM Journal on Optimization, vol. 25, no. 2, pp. 944–966, 2015

2015

[56] [56]

Solving a class of non-convex min-max games using iterative first order methods,

M. Nouiehed, M. Sanjabi, T. Huang, J. D. Lee, and M. Razaviyayn, “Solving a class of non-convex min-max games using iterative first order methods,” inAdvances in Neural Information Processing Systems, vol. 32, 2019, pp. 14 905–14 916

2019

[57] [57]

Loss landscapes and optimization in over-parameterized non-linear systems and neural networks,

C. Liu, L. Zhu, and M. Belkin, “Loss landscapes and optimization in over-parameterized non-linear systems and neural networks,”Applied and Computational Harmonic Analysis, vol. 59, pp. 85–116, 2022

2022

[58] [58]

Global convergence and variance reduction for a class of nonconvex-nonconcave minimax problems,

J. Yang, N. Kiyavash, and N. He, “Global convergence and variance reduction for a class of nonconvex-nonconcave minimax problems,” in Advances in Neural Information Processing Systems, vol. 33, 2020, pp. 1153–1165

2020

[59] [59]

Agnostic federated learning,

M. Mohri, G. Sivek, and A. T. Suresh, “Agnostic federated learning,” in International Conference on Machine Learning, 2019, pp. 4615–4625

2019

[60] [60]

Solving a class of non-convex minimax optimization in federated learning,

X. Wu, J. Sun, Z. Hu, A. Zhang, and H. Huang, “Solving a class of non-convex minimax optimization in federated learning,” inAdvances in Neural Information Processing Systems, vol. 36, 2024

2024

[61] [61]

Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms,

H. Xiao, K. Rasul, and R. V ollgraf, “Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms,” arXiv:1708.07747, 2017. 12 APPENDIXA NOTATION To facilitate analysis, we first introduce the concatenated variables, i.e., z≜[x;y],z≜[x;y]∈R d1+d2 .(18) The averaged variable and gradient estimator are defined as zc,i ≜[x c,i;y c,i]≜ 1...

Pith/arXiv arXiv 2017

[62] [62]

For convenience, let R≜Q k(x+µ xu, y;ξ)−Q k(z;ξ)− ⟨∇ xQk(z;ξ), µ xu⟩,(144) whereu,ξare independent ofz

Note that E qx i (zk,i;ξ j k,i,u j k,i)−q x i (zk,i−1;ξ j k,i,u j k,i) 2 =E d1(Qk(xk,i +µ xuj k,i,y k,i;ξ j k,i)−Q k(zk,i;ξ j k,i)) µx uj k,i − d1(Qk(xk,i−1 +µ xuj k,i,y k,i−1;ξ j k,i)−Q k(zk,i−1;ξ j k,i)) µx uj k,i 2 (a) =d 2 1E Qk(xk,i +µ xuj k,i,y k,i;ξ j k,i)−Q k(zk,i;ξ j k,i)− ⟨∇ xQk(zk,i;ξ j k,i), µxuj k,i⟩ µx uj k,i − Qk(xk,i−1 +µ xuj k,i,y k,i−1;ξ...

[63] [63]

(181) Proof.According to Lemma 1, given aL-smooth functionP(z), thenP µx(x)≜E u∼Unil(Bd1)P(x+µ xu)isL ′(≤L)-smooth, it follows that Pµx(xc,i+1)≤P µx(xc,i) +⟨∇P µx(xc,i),x c,i+1 −x c,i⟩+ Lη2 x 2 ∥gx c,i∥2 ≤P µx(xc,i)−η x⟨∇Pµx(xc,i),g x c,i⟩+ Lη2 x 2 ∥gx c,i∥2 ≤P µx(xc,i)− ηx 2 ∥∇Pµx(xc,i)∥2 − ηx 2 ∥gx c,i∥2 + ηx 2 ∥∇Pµx(xc,i)−g x c,i∥2 + Lη2 x 2 ∥gx c,i∥2....