pith. machine review for the scientific record. sign in

arxiv: 2605.03400 · v1 · submitted 2026-05-05 · 🧮 math.OC

Recognition: unknown

A Quadratic-Approximation-Based Stochastic Approximation Method for Weakly Convex Stochastic Programming

Authors on Pith no claims yet

Pith reviewed 2026-05-07 15:49 UTC · model grok-4.3

classification 🧮 math.OC
keywords stochastic approximationweakly convex optimizationproximal method of multipliersquadratic approximationKKT conditionsconvergence ratesstochastic programmingsample complexity
0
0 comments X

The pith

PMQSopt integrates proximal multipliers with quadratic approximations to reach O(T^{-1/4}) expected convergence on KKT metrics for weakly convex stochastic programs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes PMQSopt, a stochastic approximation algorithm that combines the proximal method of multipliers with quadratic approximations of the original stochastic problem to solve weakly convex stochastic optimization problems involving expectation-valued functions. It proves that after T iterations the expected squared norm of the gradient of the Moreau envelope of the Lagrangian, the expected constraint violation, and the expected complementarity violation each converge at rate O(T^{-1/4}), with accompanying high-probability bounds that are slightly weaker. These guarantees hold under the two stated conditions of weak convexity of the problem functions and existence of a strictly feasible point, and the method is shown to be sequentially strongly convex and directly implementable.

Core claim

PMQSopt is constructed by integrating the proximal method of multipliers with quadratic approximations of the original stochastic problem. The convergence of the algorithm is characterized by three metrics associated with the epsilon-KKT conditions: the average squared norm of the gradient of the Moreau envelope of the Lagrangian, the average constraint violation, and the average complementarity violation. For each of these metrics the paper establishes an expected convergence rate of O(T^{-1/4}) after T iterations, together with explicit high-probability bounds of O(T^{-1/8}) or O(T^{-1/4}) that hold with probability at least 1 minus a small power of 1/T.

What carries the argument

The proximal method of multipliers combined with quadratic approximations of the stochastic objective and constraints, which produces a sequence of strongly convex subproblems whose solutions are used to update the multipliers and the quadratic model.

If this is right

  • The total number of stochastic gradient evaluations required to reach a given accuracy is characterized directly by the iteration count T.
  • The method produces iterates that satisfy high-probability bounds on the Lagrangian gradient, constraint violation, and complementarity violation simultaneously.
  • The algorithm remains applicable to constrained stochastic programs whose functions satisfy only weak convexity rather than strong convexity or smoothness.
  • Numerical performance is illustrated on test problems, confirming that the method can be implemented and run without additional tuning beyond the stated assumptions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The quadratic-approximation step may allow reuse of existing convex solvers for the subproblems, lowering the barrier to practical deployment in large-scale settings.
  • The high-probability bounds suggest the method could be paired with variance-reduction or mini-batch techniques to tighten the probabilistic guarantees without changing the core iteration.
  • Because the analysis separates the handling of the Lagrangian gradient from the feasibility metrics, the same framework could be adapted to other penalty or augmented-Lagrangian schemes for nonconvex stochastic programs.

Load-bearing premise

All convergence claims rest on the assumption that every problem function is weakly convex and that a strictly feasible point exists.

What would settle it

Construct a weakly convex stochastic program with a strictly feasible point, run PMQSopt for large T, and observe that any of the three KKT metrics fails to improve at rate O(T^{-1/4}) in expectation.

Figures

Figures reproduced from arXiv: 2605.03400 by Benqi Liu, Liwei Zhang, Xiantao Xiao, Yule Zhang.

Figure 1
Figure 1. Figure 1: Objective value and constraint violation versus the number of stochastic gradients on the view at source ↗
Figure 2
Figure 2. Figure 2: Empirical decay of the theory-driven residuals for PMQSopt on the synthetic QCNP. view at source ↗
Figure 3
Figure 3. Figure 3: Objective value and constraint violation versus the number of stochastic gradients on view at source ↗
Figure 4
Figure 4. Figure 4: Objective value and constraint violation versus the number of stochastic gradients on view at source ↗
Figure 5
Figure 5. Figure 5: Objective value and constraint violation versus the number of stochastic gradients on view at source ↗
Figure 6
Figure 6. Figure 6: Objective value and fairness constraint violation versus the number of stochastic gradients view at source ↗
Figure 7
Figure 7. Figure 7: Objective value and fairness constraint violation versus the number of stochastic gradients view at source ↗
Figure 8
Figure 8. Figure 8: Objective value and fairness constraint violation versus the number of stochastic gradients view at source ↗
read the original abstract

We propose a novel stochastic approximation algorithm, termed PMQSopt, for solving weakly convex stochastic optimization problems involving expectation-valued functions. The algorithm is constructed by integrating the proximal method of multipliers with quadratic approximations of the original stochastic problem. We analyze the sample complexity of PMQSopt in terms of the total number of stochastic gradient evaluations required. The convergence of the algorithm is characterized by three metrics associated with the $\epsilon$-KKT conditions: the average squared norm of the gradient of the Moreau envelope of the Lagrangian, the average constraint violation, and the average complementarity violation. For each of these metrics, we establish an expected convergence rate of $\mathcal{O}(T^{-1/4})$ after $T$ iterations. Furthermore, we show that with probability at least $1-1/T^{2/3}$, the gradient of the Lagrangian satisfies an $\mathcal{O}(T^{-1/8})$ bound; with probability at least $1-2/T^{2/3}$, the constraint violation achieves an $\mathcal{O}(T^{-1/4})$ bound; and with probability at least $1-3/T^{2/3}$, the complementarity violation attains an $\mathcal{O}(T^{-1/4})$ bound. All results are established under two mild conditions: (i) weak convexity of all problem functions, and (ii) the existence of a strictly feasible point. The proposed PMQSopt algorithm is a sequentially strongly convex programming method that is readily implementable. Numerical experiments illustrate its practical performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper proposes the PMQSopt algorithm for weakly convex stochastic programming problems involving expectation-valued functions. It combines the proximal method of multipliers with quadratic approximations and analyzes sample complexity via three KKT metrics: the average squared norm of the gradient of the Moreau envelope of the Lagrangian, average constraint violation, and average complementarity violation. Expected rates of O(T^{-1/4}) are claimed for all three after T iterations, with high-probability bounds of O(T^{-1/8}) for the Lagrangian gradient (prob. 1-1/T^{2/3}) and O(T^{-1/4}) for the violations (probs. 1-2/T^{2/3} and 1-3/T^{2/3}), under weak convexity and existence of a strictly feasible point.

Significance. If the rates hold, the work provides a new, implementable stochastic approximation method for constrained weakly convex problems under mild assumptions, filling a gap between convex SA methods and general nonconvex ones. The dual expected and high-probability guarantees, plus the sequentially strongly convex subproblem structure, add both theoretical and practical value for applications in machine learning and optimization.

major comments (1)
  1. [Abstract] Abstract: The high-probability bound for the gradient of the Lagrangian is O(T^{-1/8}) with probability 1-1/T^{2/3}, while constraint and complementarity violations achieve the faster O(T^{-1/4}) with probabilities 1-2/T^{2/3} and 1-3/T^{2/3}. This selective rate degradation for the Moreau-envelope gradient metric is not explained. If the variance/martingale assumptions are identical across metrics, the analysis appears to apply a weaker concentration step or different parameter tuning for the gradient bound; the full proof should clarify the source of the looseness or tighten it to O(T^{-1/4}).
minor comments (2)
  1. [Abstract] The abstract states that numerical experiments illustrate practical performance but provides no details on test problems, dimensions, or baselines; adding a brief description would improve clarity.
  2. Ensure consistent use of mathcal{O} notation and explicit statement of all probability exponents throughout the manuscript.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading and constructive feedback on our manuscript. We address the major comment point by point below.

read point-by-point responses
  1. Referee: The high-probability bound for the gradient of the Lagrangian is O(T^{-1/8}) with probability 1-1/T^{2/3}, while constraint and complementarity violations achieve the faster O(T^{-1/4}) with probabilities 1-2/T^{2/3} and 1-3/T^{2/3}. This selective rate degradation for the Moreau-envelope gradient metric is not explained. If the variance/martingale assumptions are identical across metrics, the analysis appears to apply a weaker concentration step or different parameter tuning for the gradient bound; the full proof should clarify the source of the looseness or tighten it to O(T^{-1/4}).

    Authors: We agree the abstract and convergence theorem statement do not explain the rate difference. The O(T^{-1/8}) high-probability bound for the Moreau-envelope gradient arises because this metric requires controlling a recursive sequence of quadratic approximation errors inside the proximal multiplier updates; we apply Freedman's martingale inequality to the associated stochastic gradient noise, and optimizing the deviation parameter for failure probability 1/T^{2/3} produces the square-root degradation. Constraint and complementarity violations are direct non-negative averages that concentrate via a standard Hoeffding bound without the recursive layer, preserving O(T^{-1/4}). The differing probabilities (1-1/T^{2/3} vs. 1-2/T^{2/3}, 1-3/T^{2/3}) follow from the number of union-bound events in each proof. Under only weak convexity and strict feasibility we cannot tighten the gradient bound to O(T^{-1/4}) without stronger variance or boundedness assumptions that would restrict applicability. We will revise by adding a clarifying paragraph after Theorem 3.2 and a remark in the appendix proof. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation; algorithm and rates derived independently from assumptions

full rationale

The paper introduces PMQSopt by combining the proximal method of multipliers with quadratic approximations of the stochastic problem. Expected O(T^{-1/4}) rates for the three KKT metrics (Moreau-envelope gradient norm, constraint violation, complementarity violation) and the high-probability bounds are obtained by direct analysis of the algorithm iterates under the stated assumptions of weak convexity and existence of a strictly feasible point. No self-definitional steps appear, no fitted parameters are relabeled as predictions, and no load-bearing claims reduce to self-citations or prior ansatzes by the same authors. The derivation chain remains self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The convergence claims rest on two domain assumptions that are not derived inside the paper.

axioms (2)
  • domain assumption weak convexity of all problem functions
    Invoked as condition (i) to obtain the stated rates
  • domain assumption existence of a strictly feasible point
    Invoked as condition (ii) to obtain the stated rates

pith-pipeline@v0.9.0 · 5579 in / 1296 out tokens · 76480 ms · 2026-05-07T15:49:38.419946+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

30 extracted references · 2 canonical work pages

  1. [1]

    Beck,First-Order Methods in Optimization, SIAM, Philadelphia, 2017

    A. Beck,First-Order Methods in Optimization, SIAM, Philadelphia, 2017. 37

  2. [2]

    D. Boob, Q. Deng and G. Lan, Stochastic first-order methods for convex and nonconvex func- tional constrained optimization,Mathematical Programming, 197 (2023), 215–279

  3. [3]

    K. L. Chung, On a stochastic approximation method,Annals of Mathematical Statistics, 25 (1954), 463–483

  4. [4]

    Workbench Team C, A Marketing Dataset, 2008.http://www.causality.inf.ethz.ch/ data/CINA.html

  5. [5]

    F. E. Curtis, M. J. O’Neill and D. P. Robinson, Worst-case complexity of an SQP method for nonlinear equality constrained stochastic optimization,Mathematical Programming, 205 (2024), 431–483

  6. [6]

    Davis and D

    D. Davis and D. Drusvyatskiy, Stochastic model-based minimization of weakly convex func- tions,SIAM Journal on Optimization, 29:1(2019), 207–239

  7. [7]

    Drusvyatskiy and A

    D. Drusvyatskiy and A. S. Lewis, Error bounds, quadratic growth, and linear convergence of proximal methods,Mathematics of Operations Research, 43:3 (2018), 919–948

  8. [8]

    Ghadimi and G

    S. Ghadimi and G. Lan, Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization, I: a generic algorithmic framework,SIAM Journal on Op- timization, 22 (2012), 1469–1492

  9. [9]

    Ghadimi and G

    S. Ghadimi and G. Lan, Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization, II: shrinking procedures and optimal algorithms,SIAM Journal on Optimization, 23:4 (2013), 2061–2089

  10. [10]

    Ghadimi and G

    S. Ghadimi and G. Lan, Stochastic first-and zeroth-order methods for nonconvex stochastic programming,SIAM Journal on Optimization, 23:4 (2013), 2341–2368

  11. [11]

    Ghadimi, G

    S. Ghadimi, G. Lan and H. Zhang, Mini-batch stochastic approximation methods for noncon- vex stochastic composite optimization,Mathematical Programming, 155:1 (2016), 267–305

  12. [12]

    Guyon, S

    I. Guyon, S. Gunn, A. Ben-Hur and G. Dror, Result analysis of the NIPS 2003 feature selection challenge,Advances in Neural Information Processing Systems, V ol. 17, MIT Press, 2004, pp. 545–552

  13. [13]

    Jin and X

    L. Jin and X. Wang, A stochastic primal-dual method for a class of nonconvex constrained optimization,Computational Optimization and Applications, 83 (2022), 143–180

  14. [14]

    Lan,First-Order and Stochastic Optimization Methods for Machine Learning, Springer Series in the Data Sciences, Springer, Cham, Switzerland, 2020

    G. Lan,First-Order and Stochastic Optimization Methods for Machine Learning, Springer Series in the Data Sciences, Springer, Cham, Switzerland, 2020

  15. [15]

    Lan, An optimal method for stochastic composite optimization,Mathematical Program- ming, 133:1 (2012), 365–397

    G. Lan, An optimal method for stochastic composite optimization,Mathematical Program- ming, 133:1 (2012), 365–397

  16. [16]

    G. Lan, A. Nemirovski and A. Shapiro, Validation analysis of mirror descent stochastic ap- proximation method,Mathematical Programming, 134:2 (2012), 425–458

  17. [17]

    Lan and Z

    G. Lan and Z. Zhou, Algorithms for stochastic optimization with expectation constraints,Com- putational Optimization and Applications, 76 (2020), 461–498. 38

  18. [18]

    LeCun, C

    Y . LeCun, C. Cortes and C. J. C. Burges, The MNIST Database of Handwritten Digits, 2010. http://yann.lecun.com/exdb/mnist/

  19. [19]

    Li, P.-Y

    Z. Li, P.-Y . Chen, S. Liu, S. Lu and Y . Xu, Stochastic inexact augmented Lagrangian method for nonconvex expectation constrained optimization,Computational Optimization and Appli- cations, 87 (2024), 117–147

  20. [20]

    Nemirovski, A

    A. Nemirovski, A. Juditsky, G. Lan and A. Shapiro, Robust stochastic approximation approach to stochastic programming,SIAM Journal on Optimization, 19 (2009), 1574–1609

  21. [21]

    B. T. Polyak and A. B. Juditsky, Acceleration of stochastic approximation by averaging,SIAM Journal on Control and Optimization, 30 (1992), 838–855

  22. [22]

    Robbins and S

    H. Robbins and S. Monro, A stochastic approximation method,Annals of Mathematical Statis- tics, 22 (1951), 400–407

  23. [23]

    R. T. Rockafellar,Convex Analysis, Princeton University Press, Princeton, 1970

  24. [24]

    Sacks, Asymptotic distribution of stochastic approximation,Annals of Mathematical Statis- tics, 29 (1958), 373–409

    J. Sacks, Asymptotic distribution of stochastic approximation,Annals of Mathematical Statis- tics, 29 (1958), 373–409

  25. [25]

    Shalev-Shwartz, Online learning and online convex optimization,Foundations and Trends in Machine Learning, 4:2 (2011), 107–194

    S. Shalev-Shwartz, Online learning and online convex optimization,Foundations and Trends in Machine Learning, 4:2 (2011), 107–194

  26. [26]

    Q. Shi, X. Wang and H. Wang, A momentum-based linearized augmented Lagrangian method for nonconvex constrained stochastic optimization,Mathematics of Operations Research, pub- lished online, 23 Jan 2025,https://doi.org/10.1287/moor.2022.0193

  27. [27]

    Xiao, Dual averaging methods for regularized stochastic learning and online optimization, Journal of Machine Learning Research, 11 (2010), 2543–2596

    L. Xiao, Dual averaging methods for regularized stochastic learning and online optimization, Journal of Machine Learning Research, 11 (2010), 2543–2596

  28. [28]

    Xu, Primal-dual stochastic gradient method for convex programs with many functional constraints,SIAM Journal on Optimization, 30:2 (2020), 1664–1692

    Y . Xu, Primal-dual stochastic gradient method for convex programs with many functional constraints,SIAM Journal on Optimization, 30:2 (2020), 1664–1692

  29. [29]

    H. Yu, M. J. Neely and X. Wei, Online convex optimization with stochastic constraints, arXiv:1708.03741v1, 2017

  30. [30]

    Zhang, Y

    L. Zhang, Y . Zhang, X. Xiao and J. Wu, Stochastic approximation proximal method of multi- pliers for convex stochastic programming,Mathematics of Operations Research, 48:1 (2023), 177–193. 39