Recognition: unknown
A Quadratic-Approximation-Based Stochastic Approximation Method for Weakly Convex Stochastic Programming
Pith reviewed 2026-05-07 15:49 UTC · model grok-4.3
The pith
PMQSopt integrates proximal multipliers with quadratic approximations to reach O(T^{-1/4}) expected convergence on KKT metrics for weakly convex stochastic programs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PMQSopt is constructed by integrating the proximal method of multipliers with quadratic approximations of the original stochastic problem. The convergence of the algorithm is characterized by three metrics associated with the epsilon-KKT conditions: the average squared norm of the gradient of the Moreau envelope of the Lagrangian, the average constraint violation, and the average complementarity violation. For each of these metrics the paper establishes an expected convergence rate of O(T^{-1/4}) after T iterations, together with explicit high-probability bounds of O(T^{-1/8}) or O(T^{-1/4}) that hold with probability at least 1 minus a small power of 1/T.
What carries the argument
The proximal method of multipliers combined with quadratic approximations of the stochastic objective and constraints, which produces a sequence of strongly convex subproblems whose solutions are used to update the multipliers and the quadratic model.
If this is right
- The total number of stochastic gradient evaluations required to reach a given accuracy is characterized directly by the iteration count T.
- The method produces iterates that satisfy high-probability bounds on the Lagrangian gradient, constraint violation, and complementarity violation simultaneously.
- The algorithm remains applicable to constrained stochastic programs whose functions satisfy only weak convexity rather than strong convexity or smoothness.
- Numerical performance is illustrated on test problems, confirming that the method can be implemented and run without additional tuning beyond the stated assumptions.
Where Pith is reading between the lines
- The quadratic-approximation step may allow reuse of existing convex solvers for the subproblems, lowering the barrier to practical deployment in large-scale settings.
- The high-probability bounds suggest the method could be paired with variance-reduction or mini-batch techniques to tighten the probabilistic guarantees without changing the core iteration.
- Because the analysis separates the handling of the Lagrangian gradient from the feasibility metrics, the same framework could be adapted to other penalty or augmented-Lagrangian schemes for nonconvex stochastic programs.
Load-bearing premise
All convergence claims rest on the assumption that every problem function is weakly convex and that a strictly feasible point exists.
What would settle it
Construct a weakly convex stochastic program with a strictly feasible point, run PMQSopt for large T, and observe that any of the three KKT metrics fails to improve at rate O(T^{-1/4}) in expectation.
Figures
read the original abstract
We propose a novel stochastic approximation algorithm, termed PMQSopt, for solving weakly convex stochastic optimization problems involving expectation-valued functions. The algorithm is constructed by integrating the proximal method of multipliers with quadratic approximations of the original stochastic problem. We analyze the sample complexity of PMQSopt in terms of the total number of stochastic gradient evaluations required. The convergence of the algorithm is characterized by three metrics associated with the $\epsilon$-KKT conditions: the average squared norm of the gradient of the Moreau envelope of the Lagrangian, the average constraint violation, and the average complementarity violation. For each of these metrics, we establish an expected convergence rate of $\mathcal{O}(T^{-1/4})$ after $T$ iterations. Furthermore, we show that with probability at least $1-1/T^{2/3}$, the gradient of the Lagrangian satisfies an $\mathcal{O}(T^{-1/8})$ bound; with probability at least $1-2/T^{2/3}$, the constraint violation achieves an $\mathcal{O}(T^{-1/4})$ bound; and with probability at least $1-3/T^{2/3}$, the complementarity violation attains an $\mathcal{O}(T^{-1/4})$ bound. All results are established under two mild conditions: (i) weak convexity of all problem functions, and (ii) the existence of a strictly feasible point. The proposed PMQSopt algorithm is a sequentially strongly convex programming method that is readily implementable. Numerical experiments illustrate its practical performance.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes the PMQSopt algorithm for weakly convex stochastic programming problems involving expectation-valued functions. It combines the proximal method of multipliers with quadratic approximations and analyzes sample complexity via three KKT metrics: the average squared norm of the gradient of the Moreau envelope of the Lagrangian, average constraint violation, and average complementarity violation. Expected rates of O(T^{-1/4}) are claimed for all three after T iterations, with high-probability bounds of O(T^{-1/8}) for the Lagrangian gradient (prob. 1-1/T^{2/3}) and O(T^{-1/4}) for the violations (probs. 1-2/T^{2/3} and 1-3/T^{2/3}), under weak convexity and existence of a strictly feasible point.
Significance. If the rates hold, the work provides a new, implementable stochastic approximation method for constrained weakly convex problems under mild assumptions, filling a gap between convex SA methods and general nonconvex ones. The dual expected and high-probability guarantees, plus the sequentially strongly convex subproblem structure, add both theoretical and practical value for applications in machine learning and optimization.
major comments (1)
- [Abstract] Abstract: The high-probability bound for the gradient of the Lagrangian is O(T^{-1/8}) with probability 1-1/T^{2/3}, while constraint and complementarity violations achieve the faster O(T^{-1/4}) with probabilities 1-2/T^{2/3} and 1-3/T^{2/3}. This selective rate degradation for the Moreau-envelope gradient metric is not explained. If the variance/martingale assumptions are identical across metrics, the analysis appears to apply a weaker concentration step or different parameter tuning for the gradient bound; the full proof should clarify the source of the looseness or tighten it to O(T^{-1/4}).
minor comments (2)
- [Abstract] The abstract states that numerical experiments illustrate practical performance but provides no details on test problems, dimensions, or baselines; adding a brief description would improve clarity.
- Ensure consistent use of mathcal{O} notation and explicit statement of all probability exponents throughout the manuscript.
Simulated Author's Rebuttal
We thank the referee for their careful reading and constructive feedback on our manuscript. We address the major comment point by point below.
read point-by-point responses
-
Referee: The high-probability bound for the gradient of the Lagrangian is O(T^{-1/8}) with probability 1-1/T^{2/3}, while constraint and complementarity violations achieve the faster O(T^{-1/4}) with probabilities 1-2/T^{2/3} and 1-3/T^{2/3}. This selective rate degradation for the Moreau-envelope gradient metric is not explained. If the variance/martingale assumptions are identical across metrics, the analysis appears to apply a weaker concentration step or different parameter tuning for the gradient bound; the full proof should clarify the source of the looseness or tighten it to O(T^{-1/4}).
Authors: We agree the abstract and convergence theorem statement do not explain the rate difference. The O(T^{-1/8}) high-probability bound for the Moreau-envelope gradient arises because this metric requires controlling a recursive sequence of quadratic approximation errors inside the proximal multiplier updates; we apply Freedman's martingale inequality to the associated stochastic gradient noise, and optimizing the deviation parameter for failure probability 1/T^{2/3} produces the square-root degradation. Constraint and complementarity violations are direct non-negative averages that concentrate via a standard Hoeffding bound without the recursive layer, preserving O(T^{-1/4}). The differing probabilities (1-1/T^{2/3} vs. 1-2/T^{2/3}, 1-3/T^{2/3}) follow from the number of union-bound events in each proof. Under only weak convexity and strict feasibility we cannot tighten the gradient bound to O(T^{-1/4}) without stronger variance or boundedness assumptions that would restrict applicability. We will revise by adding a clarifying paragraph after Theorem 3.2 and a remark in the appendix proof. revision: yes
Circularity Check
No circularity in derivation; algorithm and rates derived independently from assumptions
full rationale
The paper introduces PMQSopt by combining the proximal method of multipliers with quadratic approximations of the stochastic problem. Expected O(T^{-1/4}) rates for the three KKT metrics (Moreau-envelope gradient norm, constraint violation, complementarity violation) and the high-probability bounds are obtained by direct analysis of the algorithm iterates under the stated assumptions of weak convexity and existence of a strictly feasible point. No self-definitional steps appear, no fitted parameters are relabeled as predictions, and no load-bearing claims reduce to self-citations or prior ansatzes by the same authors. The derivation chain remains self-contained.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption weak convexity of all problem functions
- domain assumption existence of a strictly feasible point
Reference graph
Works this paper leans on
-
[1]
Beck,First-Order Methods in Optimization, SIAM, Philadelphia, 2017
A. Beck,First-Order Methods in Optimization, SIAM, Philadelphia, 2017. 37
2017
-
[2]
D. Boob, Q. Deng and G. Lan, Stochastic first-order methods for convex and nonconvex func- tional constrained optimization,Mathematical Programming, 197 (2023), 215–279
2023
-
[3]
K. L. Chung, On a stochastic approximation method,Annals of Mathematical Statistics, 25 (1954), 463–483
1954
-
[4]
Workbench Team C, A Marketing Dataset, 2008.http://www.causality.inf.ethz.ch/ data/CINA.html
2008
-
[5]
F. E. Curtis, M. J. O’Neill and D. P. Robinson, Worst-case complexity of an SQP method for nonlinear equality constrained stochastic optimization,Mathematical Programming, 205 (2024), 431–483
2024
-
[6]
Davis and D
D. Davis and D. Drusvyatskiy, Stochastic model-based minimization of weakly convex func- tions,SIAM Journal on Optimization, 29:1(2019), 207–239
2019
-
[7]
Drusvyatskiy and A
D. Drusvyatskiy and A. S. Lewis, Error bounds, quadratic growth, and linear convergence of proximal methods,Mathematics of Operations Research, 43:3 (2018), 919–948
2018
-
[8]
Ghadimi and G
S. Ghadimi and G. Lan, Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization, I: a generic algorithmic framework,SIAM Journal on Op- timization, 22 (2012), 1469–1492
2012
-
[9]
Ghadimi and G
S. Ghadimi and G. Lan, Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization, II: shrinking procedures and optimal algorithms,SIAM Journal on Optimization, 23:4 (2013), 2061–2089
2013
-
[10]
Ghadimi and G
S. Ghadimi and G. Lan, Stochastic first-and zeroth-order methods for nonconvex stochastic programming,SIAM Journal on Optimization, 23:4 (2013), 2341–2368
2013
-
[11]
Ghadimi, G
S. Ghadimi, G. Lan and H. Zhang, Mini-batch stochastic approximation methods for noncon- vex stochastic composite optimization,Mathematical Programming, 155:1 (2016), 267–305
2016
-
[12]
Guyon, S
I. Guyon, S. Gunn, A. Ben-Hur and G. Dror, Result analysis of the NIPS 2003 feature selection challenge,Advances in Neural Information Processing Systems, V ol. 17, MIT Press, 2004, pp. 545–552
2003
-
[13]
Jin and X
L. Jin and X. Wang, A stochastic primal-dual method for a class of nonconvex constrained optimization,Computational Optimization and Applications, 83 (2022), 143–180
2022
-
[14]
Lan,First-Order and Stochastic Optimization Methods for Machine Learning, Springer Series in the Data Sciences, Springer, Cham, Switzerland, 2020
G. Lan,First-Order and Stochastic Optimization Methods for Machine Learning, Springer Series in the Data Sciences, Springer, Cham, Switzerland, 2020
2020
-
[15]
Lan, An optimal method for stochastic composite optimization,Mathematical Program- ming, 133:1 (2012), 365–397
G. Lan, An optimal method for stochastic composite optimization,Mathematical Program- ming, 133:1 (2012), 365–397
2012
-
[16]
G. Lan, A. Nemirovski and A. Shapiro, Validation analysis of mirror descent stochastic ap- proximation method,Mathematical Programming, 134:2 (2012), 425–458
2012
-
[17]
Lan and Z
G. Lan and Z. Zhou, Algorithms for stochastic optimization with expectation constraints,Com- putational Optimization and Applications, 76 (2020), 461–498. 38
2020
-
[18]
LeCun, C
Y . LeCun, C. Cortes and C. J. C. Burges, The MNIST Database of Handwritten Digits, 2010. http://yann.lecun.com/exdb/mnist/
2010
-
[19]
Li, P.-Y
Z. Li, P.-Y . Chen, S. Liu, S. Lu and Y . Xu, Stochastic inexact augmented Lagrangian method for nonconvex expectation constrained optimization,Computational Optimization and Appli- cations, 87 (2024), 117–147
2024
-
[20]
Nemirovski, A
A. Nemirovski, A. Juditsky, G. Lan and A. Shapiro, Robust stochastic approximation approach to stochastic programming,SIAM Journal on Optimization, 19 (2009), 1574–1609
2009
-
[21]
B. T. Polyak and A. B. Juditsky, Acceleration of stochastic approximation by averaging,SIAM Journal on Control and Optimization, 30 (1992), 838–855
1992
-
[22]
Robbins and S
H. Robbins and S. Monro, A stochastic approximation method,Annals of Mathematical Statis- tics, 22 (1951), 400–407
1951
-
[23]
R. T. Rockafellar,Convex Analysis, Princeton University Press, Princeton, 1970
1970
-
[24]
Sacks, Asymptotic distribution of stochastic approximation,Annals of Mathematical Statis- tics, 29 (1958), 373–409
J. Sacks, Asymptotic distribution of stochastic approximation,Annals of Mathematical Statis- tics, 29 (1958), 373–409
1958
-
[25]
Shalev-Shwartz, Online learning and online convex optimization,Foundations and Trends in Machine Learning, 4:2 (2011), 107–194
S. Shalev-Shwartz, Online learning and online convex optimization,Foundations and Trends in Machine Learning, 4:2 (2011), 107–194
2011
-
[26]
Q. Shi, X. Wang and H. Wang, A momentum-based linearized augmented Lagrangian method for nonconvex constrained stochastic optimization,Mathematics of Operations Research, pub- lished online, 23 Jan 2025,https://doi.org/10.1287/moor.2022.0193
-
[27]
Xiao, Dual averaging methods for regularized stochastic learning and online optimization, Journal of Machine Learning Research, 11 (2010), 2543–2596
L. Xiao, Dual averaging methods for regularized stochastic learning and online optimization, Journal of Machine Learning Research, 11 (2010), 2543–2596
2010
-
[28]
Xu, Primal-dual stochastic gradient method for convex programs with many functional constraints,SIAM Journal on Optimization, 30:2 (2020), 1664–1692
Y . Xu, Primal-dual stochastic gradient method for convex programs with many functional constraints,SIAM Journal on Optimization, 30:2 (2020), 1664–1692
2020
- [29]
-
[30]
Zhang, Y
L. Zhang, Y . Zhang, X. Xiao and J. Wu, Stochastic approximation proximal method of multi- pliers for convex stochastic programming,Mathematics of Operations Research, 48:1 (2023), 177–193. 39
2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.