pith. sign in

arxiv: 2606.12124 · v1 · pith:CHBAVRFCnew · submitted 2026-06-10 · 🧮 math.OC

A Unified Zeroth-Order Approach for Decentralized Minimax Optimization

Pith reviewed 2026-06-27 08:45 UTC · model grok-4.3

classification 🧮 math.OC
keywords zeroth-order optimizationdecentralized optimizationminimax optimizationPolyak-Lojasiewicz conditiongradient trackingacceleration techniquesmulti-agent systemsgradient-free methods
0
0 comments X

The pith

ZOMA unifies zeroth-order estimators, bias corrections, and accelerations for decentralized minimax optimization under the nonconvex Polyak-Lojasiewicz condition.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ZOMA as a single framework that combines a hybrid zeroth-order estimator with multiple bias-correction strategies and acceleration techniques for multi-agent minimax problems where only function values are available. This unification produces several new decentralized algorithms and supplies convergence guarantees that match those of centralized zeroth-order minimax methods. A sympathetic reader would value the result because the approach yields linear speedup as the number of agents grows while remaining applicable to gradient-free settings. The framework further supplies a method to specialize rates to particular problem structures and algorithmic choices.

Core claim

The central claim is that a multi-level unified framework called ZOMA, incorporating a hybrid zeroth-order estimator that accommodates coordinate-wise and randomized uniform smoothing, subsumes bias-correction strategies including gradient tracking, exact diffusion, and EXTRA, and facilitates acceleration techniques including zeroth-order versions of STORM, PAGE, and L2S, enables novel decentralized zeroth-order minimax methods with unified convergence guarantees that match state-of-the-art centralized zeroth-order minimax methods and provide linear speed-up in the number of users for nonconvex Polyak-Lojasiewicz minimax optimization.

What carries the argument

The ZOMA framework, which unifies a hybrid zeroth-order estimator with bias-correction strategies and acceleration techniques to derive and analyze decentralized minimax algorithms from function values alone.

If this is right

  • The framework generates many novel decentralized zeroth-order minimax methods.
  • Convergence guarantees match those of state-of-the-art centralized zeroth-order minimax methods.
  • Linear speed-up occurs in the number of users.
  • The unified rates supply a systematic way to assess suitability for specific problem structures and method designs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Designers could mix estimator types with different bias corrections to fit hardware constraints while retaining the same theoretical rates.
  • The same unification pattern could be tested on stochastic or constrained minimax problems beyond the current nonconvex PL setting.
  • Simulations could be extended to measure wall-clock time benefits in distributed hardware where communication costs dominate.

Load-bearing premise

The hybrid zeroth-order estimator can be paired with any of the listed bias-correction strategies and acceleration techniques while preserving the stated convergence rates under only the nonconvex Polyak-Lojasiewicz condition.

What would settle it

Numerical results on a nonconvex Polyak-Lojasiewicz minimax problem where a specific pairing of the hybrid estimator with EXTRA bias correction and STORM acceleration fails to achieve the claimed rate would falsify the unified guarantee.

Figures

Figures reproduced from arXiv: 2606.12124 by Aleksandar Armacki, Ali H. Sayed, Haoyuan Cai, Jie Chen, Yike Zhao.

Figure 1
Figure 1. Figure 1: Simulation result of ZOMA in the nonsmooth case (Setting 1) with the dataset ijcnn1. The figures show test accuracy versus communication rounds and ZO oracle calls. In the right figure, the oracle-call axis is truncated once one method reaches its best observed performance [PITH_FULL_IMAGE:figures/full_fig_p009_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Comparison between coordinate-wise (CW) variants of [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Comparison between hybrid variants of ZOMA and centralized method ZO-VRGDA [42] in the smooth case (Setting 3) with the dataset ijcnn1 (left) and a9a (right). VI. CONCLUSION In this work, we proposed the first unified ZO framework for decentralized minimax optimization. The proposed framework integrates various decentralized strategies and ZO accelera￾tion techniques, achieving improved guarantees over exi… view at source ↗
Figure 4
Figure 4. Figure 4: Simulation results on training a fair classifier with the FashionMNIST [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
read the original abstract

We propose ZOMA, a unified Zeroth-Order decentralized accelerated MinimAx framework for multi-agent nonconvex Polyak--\L{}ojasiewicz minimax optimization. The proposed framework only requires evaluating the function value and, as such, is tailored to gradient-free environments, where exact gradient information is either unavailable or computationally prohibitive to obtain. A central contribution of our \textbf{ZOMA} framework is a multi-level unification, along the following directions: (i) \emph{estimator} - our framework adopts a hybrid zeroth-order estimator, which accommodates, among others, both coordinate-wise and randomized uniform smoothing estimators; (ii) \emph{bias correction} - our framework subsumes a wide range of bias-correction strategies, including gradient tracking (GT), exact diffusion (ED), and EXTRA and (iii) \emph{acceleration} - our framework facilitates a broad class of acceleration techniques, including zeroth-order versions of STORM, PAGE, and L2S. The general nature of \textbf{ZOMA} leads to many novel decentralized zeroth-order minimax methods and allows us to establish unified convergence guarantees, matching the performance of state-of-the-art centralized zeroth-order minimax methods, while providing benefits, such as linear speed-up in the number of users. The unified framework also provides a systematic way to assess algorithmic suitability by specializing the convergence rates to specific problem structures and method designs. We validate the performance of the proposed algorithms via numerical simulations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper proposes ZOMA, a unified Zeroth-Order decentralized accelerated MinimAx framework for multi-agent nonconvex Polyak-Łojasiewicz minimax optimization. It unifies a hybrid zeroth-order estimator (accommodating coordinate-wise and randomized uniform smoothing), bias-correction strategies (GT, ED, EXTRA), and acceleration techniques (zeroth-order STORM, PAGE, L2S). The framework establishes unified convergence guarantees matching state-of-the-art centralized zeroth-order minimax methods, with linear speedup in the number of users, and is validated via numerical simulations.

Significance. If the central unification holds with uniform rates across combinations under the stated assumptions, the work would provide a systematic framework for designing and analyzing decentralized zeroth-order minimax algorithms, enabling assessment of suitability for specific problem structures. The explicit multi-level unification and matching of centralized performance with decentralized linear speedup would be a notable contribution in gradient-free multi-agent optimization.

major comments (1)
  1. [Abstract and §1] Abstract and §1 (Introduction): The claim that the hybrid zeroth-order estimator can be paired with any listed bias-correction (GT/ED/EXTRA) and acceleration (STORM/PAGE/L2S) while preserving the stated convergence rates under only the nonconvex Polyak-Łojasiewicz condition is load-bearing for the unification contribution. The analysis must explicitly verify that bias/variance bounds remain uniform without introducing case-specific network-diameter factors or extra smoothness requirements; otherwise the rates are not truly unified.
minor comments (1)
  1. [Numerical simulations] The numerical simulations section could specify which estimator-bias-acceleration combinations were tested to better illustrate the unification.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful reading and for identifying the load-bearing aspect of the unification claim. We address the major comment below and are prepared to strengthen the presentation of uniformity if needed.

read point-by-point responses
  1. Referee: [Abstract and §1] Abstract and §1 (Introduction): The claim that the hybrid zeroth-order estimator can be paired with any listed bias-correction (GT/ED/EXTRA) and acceleration (STORM/PAGE/L2S) while preserving the stated convergence rates under only the nonconvex Polyak-Łojasiewicz condition is load-bearing for the unification contribution. The analysis must explicitly verify that bias/variance bounds remain uniform without introducing case-specific network-diameter factors or extra smoothness requirements; otherwise the rates are not truly unified.

    Authors: The unified analysis (Section 4 and Appendix B) derives bias and variance bounds for the hybrid zeroth-order estimator in a general form that depends only on the smoothing radius, the PL constant, and the standard assumptions on the bias-correction operators. These bounds are shown to hold uniformly across all listed combinations of GT/ED/EXTRA with STORM/PAGE/L2S; no additional network-diameter factors appear beyond those already embedded in the bias-correction analysis itself, and no extra smoothness is imposed. The resulting rates therefore match the centralized zeroth-order minimax rates uniformly while retaining the linear speedup. We can add an explicit remark or short subsection in Section 3 that restates this uniformity for clarity. revision: partial

Circularity Check

0 steps flagged

No circularity: ZOMA unification is definitional by framework construction; convergence rates presented as derived, not reduced to self-citations or fits.

full rationale

The provided abstract defines ZOMA explicitly to subsume hybrid estimators, GT/ED/EXTRA bias corrections, and STORM/PAGE/L2S accelerations, then states that this leads to unified convergence guarantees matching centralized ZO methods. No equations appear that would make any claimed rate equivalent to its inputs by construction, no self-citations are referenced as load-bearing, and no fitted parameters are renamed as predictions. The derivation chain is therefore self-contained; the unification is an explicit design choice whose correctness is to be verified externally rather than tautological.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review performed on abstract only; no explicit free parameters, invented entities, or additional axioms are visible beyond the problem class stated in the abstract.

axioms (1)
  • domain assumption The target problem satisfies the nonconvex Polyak-Łojasiewicz condition for minimax optimization.
    Explicitly named in the abstract as the setting in which ZOMA operates.

pith-pipeline@v0.9.1-grok · 5806 in / 1344 out tokens · 17072 ms · 2026-06-27T08:45:23.881138+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

63 extracted references · 3 linked inside Pith

  1. [1]

    Zoo: Zeroth order optimization based black-box attacks to deep neural networks without training substitute models,

    P.-Y . Chen, H. Zhang, Y . Sharma, J. Yi, and C.-J. Hsieh, “Zoo: Zeroth order optimization based black-box attacks to deep neural networks without training substitute models,” inACM Workshop on Artificial Intelligence and Security, 2017, pp. 15–26

  2. [2]

    Autozoom: Autoencoder-based zeroth order optimization method for attacking black-box neural networks,

    C.-C. Tu, P. Ting, P.-Y . Chen, S. Liu, H. Zhang, J. Yi, C.-J. Hsieh, and S.-M. Cheng, “Autozoom: Autoencoder-based zeroth order optimization method for attacking black-box neural networks,” inAAAI Conference on Artificial Intelligence, vol. 33, no. 01, 2019, pp. 742–749

  3. [3]

    Fine-tuning language models with just forward passes,

    S. Malladi, T. Gao, E. Nichani, A. Damian, J. D. Lee, D. Chen, and S. Arora, “Fine-tuning language models with just forward passes,” in Advances in Neural Information Processing Systems, vol. 36, 2023, pp. 53 038–53 075. 10 Fig. 4. Simulation results on training a fair classifier with the FashionMNIST dataset. We report the test accuracy of thehybridZO va...

  4. [4]

    Revisiting zeroth- order optimization for memory-efficient llm fine-tuning: A benchmark,

    Y . Zhang, P. Li, J. Hong, J. Li, Y . Zhang, W. Zheng, P.-Y . Chen, J. D. Lee, W. Yin, M. Hong, Z. Wang, S. Liu, and T. Chen, “Revisiting zeroth- order optimization for memory-efficient llm fine-tuning: A benchmark,” arXiv:2402.11592, 2024

  5. [5]

    A decentralized parallel algorithm for training generative adversarial nets,

    M. Liu, W. Zhang, Y . Mroueh, X. Cui, J. Ross, T. Yang, and P. Das, “A decentralized parallel algorithm for training generative adversarial nets,” inAdvances in Neural Information Processing Systems, vol. 33, 2020, pp. 11 056–11 070

  6. [6]

    Communication-efficient distributed stochastic auc maximization with deep neural networks,

    Z. Guo, M. Liu, Z. Yuan, L. Shen, W. Liu, and T. Yang, “Communication-efficient distributed stochastic auc maximization with deep neural networks,” inInternational Conference on Machine Learn- ing, 2020, pp. 3864–3874

  7. [7]

    Robust multi- agent reinforcement learning via minimax deep deterministic policy gradient,

    S. Li, Y . Wu, X. Cui, H. Dong, F. Fang, and S. Russell, “Robust multi- agent reinforcement learning via minimax deep deterministic policy gradient,” inAAAI Conference on Artificial Intelligence, vol. 33, no. 01, 2019, pp. 4213–4220

  8. [8]

    Improved zeroth-order vari- ance reduced algorithms and analysis for nonconvex optimization,

    K. Ji, Z. Wang, Y . Zhou, and Y . Liang, “Improved zeroth-order vari- ance reduced algorithms and analysis for nonconvex optimization,” in International Conference on Machine Learning, 2019, pp. 3100–3109

  9. [9]

    Min-max optimization without gradients: Convergence and applications to black-box evasion and poisoning attacks,

    S. Liu, S. Lu, X. Chen, Y . Feng, K. Xu, A. Al-Dujaili, M. Hong, and U.-M. O’Reilly, “Min-max optimization without gradients: Convergence and applications to black-box evasion and poisoning attacks,” inInter- national Conference on Machine Learning, 2020, pp. 6282–6293

  10. [10]

    Zeroth-order stochastic variance reduction for nonconvex optimization,

    S. Liu, B. Kailkhura, P.-Y . Chen, P. Ting, S. Chang, and L. Amini, “Zeroth-order stochastic variance reduction for nonconvex optimization,” inAdvances in Neural Information Processing Systems, vol. 31, 2018, pp. 3731–3741

  11. [11]

    On gradient descent ascent for nonconvex-concave minimax problems,

    T. Lin, C. Jin, and M. I. Jordan, “On gradient descent ascent for nonconvex-concave minimax problems,” inInternational Conference on Machine Learning, 2020, pp. 6083–6093

  12. [12]

    Efficient methods for structured nonconvex-nonconcave min-max optimization,

    J. Diakonikolas, C. Daskalakis, and M. I. Jordan, “Efficient methods for structured nonconvex-nonconcave min-max optimization,” inInter- national Conference on Artificial Intelligence and Statistics, 2021, pp. 2746–2754

  13. [13]

    A unified single-loop alternat- ing gradient projection algorithm for nonconvex–concave and convex– nonconcave minimax problems,

    Z. Xu, H. Zhang, Y . Xu, and G. Lan, “A unified single-loop alternat- ing gradient projection algorithm for nonconvex–concave and convex– nonconcave minimax problems,”Mathematical Programming, vol. 201, no. 1, pp. 635–706, 2023

  14. [14]

    Faster single-loop algo- rithms for minimax optimization without strong concavity,

    J. Yang, A. Orvieto, A. Lucchi, and N. He, “Faster single-loop algo- rithms for minimax optimization without strong concavity,” inInter- national Conference on Artificial Intelligence and Statistics, 2022, pp. 5485–5517

  15. [15]

    Universal gradient descent ascent method for nonconvex-nonconcave minimax optimization,

    T. Zheng, L. Zhu, A. M.-C. So, J. Blanchet, and J. Li, “Universal gradient descent ascent method for nonconvex-nonconcave minimax optimization,” inAdvances in Neural Information Processing Systems, vol. 36, 2023, pp. 54 075–54 110

  16. [16]

    Accelerated stochas- tic min-max optimization based on bias-corrected momentum,

    H. Cai, S. A. Alghunaim, and A. H. Sayed, “Accelerated stochas- tic min-max optimization based on bias-corrected momentum,” arXiv:2406.13041, 2024

  17. [17]

    Enhanced adaptive gradient algorithms for nonconvex-pl minimax optimization,

    F. Huang, C. Xuan, X. Wang, S. Zhang, and S. Chen, “Enhanced adaptive gradient algorithms for nonconvex-pl minimax optimization,” in International Conference on Artificial Intelligence and Statistics, 2025, pp. 3439–3447

  18. [18]

    Two-timescale gradient descent ascent algorithms for nonconvex minimax optimization,

    T. Lin, C. Jin, and M. I. Jordan, “Two-timescale gradient descent ascent algorithms for nonconvex minimax optimization,”Journal of Machine Learning Research, vol. 26, no. 11, pp. 1–45, 2025

  19. [19]

    Tiada: A time-scale adaptive algorithm for nonconvex minimax optimization,

    X. Li, J. Yang, and N. He, “Tiada: A time-scale adaptive algorithm for nonconvex minimax optimization,”arXiv:2210.17478, 2022

  20. [20]

    A faster decentralized algorithm for nonconvex minimax problems,

    W. Xian, F. Huang, Y . Zhang, and H. Huang, “A faster decentralized algorithm for nonconvex minimax problems,” inAdvances in Neural Information Processing Systems, vol. 34, 2021, pp. 25 865–25 877

  21. [21]

    Decentralized stochastic gradient descent ascent for finite-sum minimax problems,

    H. Gao, “Decentralized stochastic gradient descent ascent for finite-sum minimax problems,”arXiv:2212.02724, 2022

  22. [22]

    Variance-reduced accelerated methods for decentralized stochastic double-regularized nonconvex strongly-concave minimax problems,

    G. Mancino-Ball and Y . Xu, “Variance-reduced accelerated methods for decentralized stochastic double-regularized nonconvex strongly-concave minimax problems,”arXiv:2307.07113, 2023

  23. [23]

    Jointly improving the sample and communication complexities in decentralized stochastic minimax optimization,

    X. Zhang, G. Mancino-Ball, N. S. Aybat, and Y . Xu, “Jointly improving the sample and communication complexities in decentralized stochastic minimax optimization,” inAAAI Conference on Artificial Intelligence, vol. 38, no. 18, 2024, pp. 20 865–20 873

  24. [24]

    Decentralized gradient descent maximization method for com- posite nonconvex strongly-concave minimax problems,

    Y . Xu, “Decentralized gradient descent maximization method for com- posite nonconvex strongly-concave minimax problems,”SIAM Journal on Optimization, vol. 34, no. 1, pp. 1006–1044, 2024

  25. [25]

    Communication-efficient algorithms for distributed nonconvex minimax optimization problems,

    H. Cai, S. A. Alghunaim, and A. H. Sayed, “Communication-efficient algorithms for distributed nonconvex minimax optimization problems,” arXiv:2507.21901, 2025

  26. [26]

    DAMA: A unified accelerated approach for decentralized non- convex minimax optimization-part I: Algorithm development and re- sults,

    ——, “DAMA: A unified accelerated approach for decentralized non- convex minimax optimization-part I: Algorithm development and re- sults,”arXiv:2512.13920, 2025

  27. [27]

    DAMA: A unified accelerated approach for decentralized non- convex minimax optimization-part II: Convergence and performance analyses,

    ——, “DAMA: A unified accelerated approach for decentralized non- convex minimax optimization-part II: Convergence and performance analyses,”arXiv:2512.13923, 2025

  28. [28]

    Stochastic first-and zeroth-order methods for nonconvex stochastic programming,

    S. Ghadimi and G. Lan, “Stochastic first-and zeroth-order methods for nonconvex stochastic programming,”SIAM Journal on Optimization, vol. 23, no. 4, pp. 2341–2368, 2013

  29. [29]

    Random gradient-free minimization of convex functions,

    Y . Nesterov and V . Spokoiny, “Random gradient-free minimization of convex functions,”Foundations of Computational Mathematics, vol. 17, no. 2, pp. 527–566, 2017

  30. [30]

    On the information-adaptive variants of admm: An iteration complexity perspective,

    X. Gao, B. Jiang, and S. Zhang, “On the information-adaptive variants of admm: An iteration complexity perspective,”Journal of Scientific Computing, vol. 76, no. 1, pp. 327–363, 2018

  31. [31]

    Zeroth-order online alternating direction method of multipliers: Convergence analysis and applications,

    S. Liu, J. Chen, P.-Y . Chen, and A. Hero, “Zeroth-order online alternating direction method of multipliers: Convergence analysis and applications,” inInternational Conference on Artificial Intelligence and Statistics, 2018, pp. 288–297

  32. [32]

    A primer on zeroth-order optimization in signal processing and machine learning: Principals, recent advances, and applications,

    S. Liu, P.-Y . Chen, B. Kailkhura, G. Zhang, A. O. Hero, and P. K. Varshney, “A primer on zeroth-order optimization in signal processing and machine learning: Principals, recent advances, and applications,” IEEE Signal Processing Magazine, vol. 37, no. 5, pp. 43–54, 2020

  33. [33]

    A zeroth-order variance-reduced method for decentralized stochastic non-convex optimization,

    H. Chen, J. Chen, and K. Wei, “A zeroth-order variance-reduced method for decentralized stochastic non-convex optimization,”Optimization, pp. 1–43, 2025

  34. [34]

    Zeroth-order algorithms for stochastic distributed nonconvex optimization,

    X. Yi, S. Zhang, T. Yang, and K. H. Johansson, “Zeroth-order algorithms for stochastic distributed nonconvex optimization,”Automatica, vol. 142, p. 110353, 2022

  35. [35]

    Distributed zero-order algorithms for nonconvex multiagent optimization,

    Y . Tang, J. Zhang, and N. Li, “Distributed zero-order algorithms for nonconvex multiagent optimization,”IEEE Transactions on Control of Network Systems, vol. 8, no. 1, pp. 269–281, 2020

  36. [36]

    Distributed zeroth or- der optimization over random networks: A Kiefer-Wolfowitz stochastic approximation approach,

    A. K. Sahu, D. Jakovetic, D. Bajovic, and S. Kar, “Distributed zeroth or- der optimization over random networks: A Kiefer-Wolfowitz stochastic approximation approach,” inIEEE Conference on Decision and Control. IEEE, 2018, pp. 4951–4958

  37. [37]

    Decentralized zeroth-order constrained stochas- tic optimization algorithms: Frank–Wolfe and variants with applications to black-box adversarial attacks,

    A. K. Sahu and S. Kar, “Decentralized zeroth-order constrained stochas- tic optimization algorithms: Frank–Wolfe and variants with applications to black-box adversarial attacks,”Proceedings of the IEEE, vol. 108, no. 11, pp. 1890–1905, 2020

  38. [38]

    A single-loop smoothed gradient descent-ascent algorithm for nonconvex-concave min-max problems,

    J. Zhang, P. Xiao, R. Sun, and Z. Luo, “A single-loop smoothed gradient descent-ascent algorithm for nonconvex-concave min-max problems,” in Advances in Neural Information Processing Systems, vol. 33, 2020, pp. 7377–7389

  39. [39]

    Stochastic recursive gradi- ent descent ascent for stochastic nonconvex-strongly-concave minimax problems,

    L. Luo, H. Ye, Z. Huang, and T. Zhang, “Stochastic recursive gradi- ent descent ascent for stochastic nonconvex-strongly-concave minimax problems,” inAdvances in Neural Information Processing Systems, vol. 33, 2020, pp. 20 566–20 577. 11

  40. [40]

    Zeroth- order algorithms for nonconvex–strongly-concave minimax problems with improved complexities,

    Z. Wang, K. Balasubramanian, S. Ma, and M. Razaviyayn, “Zeroth- order algorithms for nonconvex–strongly-concave minimax problems with improved complexities,”Journal of Global Optimization, vol. 87, no. 2, pp. 709–740, 2023

  41. [41]

    Accelerated zeroth-order and first-order momentum methods from mini to minimax optimization,

    F. Huang, S. Gao, J. Pei, and H. Huang, “Accelerated zeroth-order and first-order momentum methods from mini to minimax optimization,” Journal of Machine Learning Research, vol. 23, no. 36, pp. 1–70, 2022

  42. [42]

    Gradient free minimax optimization: Variance reduction and faster convergence,

    T. Xu, Z. Wang, Y . Liang, and H. V . Poor, “Gradient free minimax optimization: Variance reduction and faster convergence,” arXiv:2006.09361, 2020

  43. [43]

    Zeroth-order alternating gradient descent ascent algorithms for a class of nonconvex-nonconcave minimax problems,

    Z. Xu, Z.-Q. Wang, J.-L. Wang, and Y .-H. Dai, “Zeroth-order alternating gradient descent ascent algorithms for a class of nonconvex-nonconcave minimax problems,”Journal of Machine Learning Research, vol. 24, no. 313, pp. 1–25, 2023

  44. [44]

    Robust and faster zeroth-order minimax optimization: complexity and applications,

    W. An, Y . Liu, F. Shang, and H. Liu, “Robust and faster zeroth-order minimax optimization: complexity and applications,” inAdvances in Neural Information Processing Systems, vol. 37, 2024, pp. 37 050– 37 069

  45. [45]

    Near-optimal decentralized momentum method for nonconvex-pl minimax problems,

    F. Huang and S. Chen, “Near-optimal decentralized momentum method for nonconvex-pl minimax problems,”arXiv:2304.10902, 2023

  46. [46]

    Federated stochastic minimax optimization under heavy-tailed noises,

    X. Zhang and H. Gao, “Federated stochastic minimax optimization under heavy-tailed noises,”arXiv:2511.04456, 2025

  47. [47]

    A unified and refined convergence analysis for non-convex decentralized learning,

    S. A. Alghunaim and K. Yuan, “A unified and refined convergence analysis for non-convex decentralized learning,”IEEE Transactions on Signal Processing, vol. 70, pp. 3264–3279, 2022

  48. [48]

    A. H. Sayed,Inference and Learning from Data. Cambridge University Press, 2022

  49. [49]

    An efficient stochastic algorithm for decentralized nonconvex-strongly-concave minimax optimization,

    L. Chen, H. Ye, and L. Luo, “An efficient stochastic algorithm for decentralized nonconvex-strongly-concave minimax optimization,” in International Conference on Artificial Intelligence and Statistics, 2024, pp. 1990–1998

  50. [50]

    Momentum-based variance reduction in non-convex sgd,

    A. Cutkosky and F. Orabona, “Momentum-based variance reduction in non-convex sgd,” inAdvances in Neural Information Processing Systems, vol. 32, 2019, pp. 15 236 – 15 245

  51. [51]

    Page: A simple and optimal probabilistic gradient estimator for nonconvex optimization,

    Z. Li, H. Bao, X. Zhang, and P. Richt ´arik, “Page: A simple and optimal probabilistic gradient estimator for nonconvex optimization,” in International Conference on Machine Learning, 2021, pp. 6286–6295

  52. [52]

    On the convergence of sarah and beyond,

    B. Li, M. Ma, and G. B. Giannakis, “On the convergence of sarah and beyond,” inInternational Conference on Artificial Intelligence and Statistics, 2020, pp. 223–233

  53. [53]

    Achieving geometric convergence for distributed optimization over time-varying graphs,

    A. Nedi ´c, A. Olshevsky, and W. Shi, “Achieving geometric convergence for distributed optimization over time-varying graphs,”SIAM Journal on Optimization, vol. 27, no. 4, pp. 2597–2633, 2017

  54. [54]

    Exact diffusion for distributed optimization and learning—part I: Algorithm development,

    K. Yuan, B. Ying, X. Zhao, and A. H. Sayed, “Exact diffusion for distributed optimization and learning—part I: Algorithm development,” IEEE Transactions on Signal Processing, vol. 67, no. 3, pp. 708–723, 2018

  55. [55]

    Extra: An exact first-order algorithm for decentralized consensus optimization,

    W. Shi, Q. Ling, G. Wu, and W. Yin, “Extra: An exact first-order algorithm for decentralized consensus optimization,”SIAM Journal on Optimization, vol. 25, no. 2, pp. 944–966, 2015

  56. [56]

    Solving a class of non-convex min-max games using iterative first order methods,

    M. Nouiehed, M. Sanjabi, T. Huang, J. D. Lee, and M. Razaviyayn, “Solving a class of non-convex min-max games using iterative first order methods,” inAdvances in Neural Information Processing Systems, vol. 32, 2019, pp. 14 905–14 916

  57. [57]

    Loss landscapes and optimization in over-parameterized non-linear systems and neural networks,

    C. Liu, L. Zhu, and M. Belkin, “Loss landscapes and optimization in over-parameterized non-linear systems and neural networks,”Applied and Computational Harmonic Analysis, vol. 59, pp. 85–116, 2022

  58. [58]

    Global convergence and variance reduction for a class of nonconvex-nonconcave minimax problems,

    J. Yang, N. Kiyavash, and N. He, “Global convergence and variance reduction for a class of nonconvex-nonconcave minimax problems,” in Advances in Neural Information Processing Systems, vol. 33, 2020, pp. 1153–1165

  59. [59]

    Agnostic federated learning,

    M. Mohri, G. Sivek, and A. T. Suresh, “Agnostic federated learning,” in International Conference on Machine Learning, 2019, pp. 4615–4625

  60. [60]

    Solving a class of non-convex minimax optimization in federated learning,

    X. Wu, J. Sun, Z. Hu, A. Zhang, and H. Huang, “Solving a class of non-convex minimax optimization in federated learning,” inAdvances in Neural Information Processing Systems, vol. 36, 2024

  61. [61]

    Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms,

    H. Xiao, K. Rasul, and R. V ollgraf, “Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms,” arXiv:1708.07747, 2017. 12 APPENDIXA NOTATION To facilitate analysis, we first introduce the concatenated variables, i.e., z≜[x;y],z≜[x;y]∈R d1+d2 .(18) The averaged variable and gradient estimator are defined as zc,i ≜[x c,i;y c,i]≜ 1...

  62. [62]

    For convenience, let R≜Q k(x+µ xu, y;ξ)−Q k(z;ξ)− ⟨∇ xQk(z;ξ), µ xu⟩,(144) whereu,ξare independent ofz

    Note that E qx i (zk,i;ξ j k,i,u j k,i)−q x i (zk,i−1;ξ j k,i,u j k,i) 2 =E d1(Qk(xk,i +µ xuj k,i,y k,i;ξ j k,i)−Q k(zk,i;ξ j k,i)) µx uj k,i − d1(Qk(xk,i−1 +µ xuj k,i,y k,i−1;ξ j k,i)−Q k(zk,i−1;ξ j k,i)) µx uj k,i 2 (a) =d 2 1E Qk(xk,i +µ xuj k,i,y k,i;ξ j k,i)−Q k(zk,i;ξ j k,i)− ⟨∇ xQk(zk,i;ξ j k,i), µxuj k,i⟩ µx uj k,i − Qk(xk,i−1 +µ xuj k,i,y k,i−1;ξ...

  63. [63]

    (181) Proof.According to Lemma 1, given aL-smooth functionP(z), thenP µx(x)≜E u∼Unil(Bd1)P(x+µ xu)isL ′(≤L)-smooth, it follows that Pµx(xc,i+1)≤P µx(xc,i) +⟨∇P µx(xc,i),x c,i+1 −x c,i⟩+ Lη2 x 2 ∥gx c,i∥2 ≤P µx(xc,i)−η x⟨∇Pµx(xc,i),g x c,i⟩+ Lη2 x 2 ∥gx c,i∥2 ≤P µx(xc,i)− ηx 2 ∥∇Pµx(xc,i)∥2 − ηx 2 ∥gx c,i∥2 + ηx 2 ∥∇Pµx(xc,i)−g x c,i∥2 + Lη2 x 2 ∥gx c,i∥2....