arxiv: 2605.03400 · v1 · submitted 2026-05-05 · 🧮 math.OC

Recognition: unknown

A Quadratic-Approximation-Based Stochastic Approximation Method for Weakly Convex Stochastic Programming

Yule Zhang , Benqi Liu , Xiantao Xiao , Liwei Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-07 15:49 UTC · model grok-4.3

classification 🧮 math.OC

keywords stochastic approximationweakly convex optimizationproximal method of multipliersquadratic approximationKKT conditionsconvergence ratesstochastic programmingsample complexity

0 comments

The pith

PMQSopt integrates proximal multipliers with quadratic approximations to reach O(T^{-1/4}) expected convergence on KKT metrics for weakly convex stochastic programs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes PMQSopt, a stochastic approximation algorithm that combines the proximal method of multipliers with quadratic approximations of the original stochastic problem to solve weakly convex stochastic optimization problems involving expectation-valued functions. It proves that after T iterations the expected squared norm of the gradient of the Moreau envelope of the Lagrangian, the expected constraint violation, and the expected complementarity violation each converge at rate O(T^{-1/4}), with accompanying high-probability bounds that are slightly weaker. These guarantees hold under the two stated conditions of weak convexity of the problem functions and existence of a strictly feasible point, and the method is shown to be sequentially strongly convex and directly implementable.

Core claim

PMQSopt is constructed by integrating the proximal method of multipliers with quadratic approximations of the original stochastic problem. The convergence of the algorithm is characterized by three metrics associated with the epsilon-KKT conditions: the average squared norm of the gradient of the Moreau envelope of the Lagrangian, the average constraint violation, and the average complementarity violation. For each of these metrics the paper establishes an expected convergence rate of O(T^{-1/4}) after T iterations, together with explicit high-probability bounds of O(T^{-1/8}) or O(T^{-1/4}) that hold with probability at least 1 minus a small power of 1/T.

What carries the argument

The proximal method of multipliers combined with quadratic approximations of the stochastic objective and constraints, which produces a sequence of strongly convex subproblems whose solutions are used to update the multipliers and the quadratic model.

If this is right

The total number of stochastic gradient evaluations required to reach a given accuracy is characterized directly by the iteration count T.
The method produces iterates that satisfy high-probability bounds on the Lagrangian gradient, constraint violation, and complementarity violation simultaneously.
The algorithm remains applicable to constrained stochastic programs whose functions satisfy only weak convexity rather than strong convexity or smoothness.
Numerical performance is illustrated on test problems, confirming that the method can be implemented and run without additional tuning beyond the stated assumptions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The quadratic-approximation step may allow reuse of existing convex solvers for the subproblems, lowering the barrier to practical deployment in large-scale settings.
The high-probability bounds suggest the method could be paired with variance-reduction or mini-batch techniques to tighten the probabilistic guarantees without changing the core iteration.
Because the analysis separates the handling of the Lagrangian gradient from the feasibility metrics, the same framework could be adapted to other penalty or augmented-Lagrangian schemes for nonconvex stochastic programs.

Load-bearing premise

All convergence claims rest on the assumption that every problem function is weakly convex and that a strictly feasible point exists.

What would settle it

Construct a weakly convex stochastic program with a strictly feasible point, run PMQSopt for large T, and observe that any of the three KKT metrics fails to improve at rate O(T^{-1/4}) in expectation.

Figures

Figures reproduced from arXiv: 2605.03400 by Benqi Liu, Liwei Zhang, Xiantao Xiao, Yule Zhang.

**Figure 1.** Figure 1: Objective value and constraint violation versus the number of stochastic gradients on the view at source ↗

**Figure 2.** Figure 2: Empirical decay of the theory-driven residuals for PMQSopt on the synthetic QCNP. view at source ↗

**Figure 3.** Figure 3: Objective value and constraint violation versus the number of stochastic gradients on view at source ↗

**Figure 4.** Figure 4: Objective value and constraint violation versus the number of stochastic gradients on view at source ↗

**Figure 5.** Figure 5: Objective value and constraint violation versus the number of stochastic gradients on view at source ↗

**Figure 6.** Figure 6: Objective value and fairness constraint violation versus the number of stochastic gradients view at source ↗

**Figure 7.** Figure 7: Objective value and fairness constraint violation versus the number of stochastic gradients view at source ↗

**Figure 8.** Figure 8: Objective value and fairness constraint violation versus the number of stochastic gradients view at source ↗

read the original abstract

We propose a novel stochastic approximation algorithm, termed PMQSopt, for solving weakly convex stochastic optimization problems involving expectation-valued functions. The algorithm is constructed by integrating the proximal method of multipliers with quadratic approximations of the original stochastic problem. We analyze the sample complexity of PMQSopt in terms of the total number of stochastic gradient evaluations required. The convergence of the algorithm is characterized by three metrics associated with the $\epsilon$-KKT conditions: the average squared norm of the gradient of the Moreau envelope of the Lagrangian, the average constraint violation, and the average complementarity violation. For each of these metrics, we establish an expected convergence rate of $\mathcal{O}(T^{-1/4})$ after $T$ iterations. Furthermore, we show that with probability at least $1-1/T^{2/3}$, the gradient of the Lagrangian satisfies an $\mathcal{O}(T^{-1/8})$ bound; with probability at least $1-2/T^{2/3}$, the constraint violation achieves an $\mathcal{O}(T^{-1/4})$ bound; and with probability at least $1-3/T^{2/3}$, the complementarity violation attains an $\mathcal{O}(T^{-1/4})$ bound. All results are established under two mild conditions: (i) weak convexity of all problem functions, and (ii) the existence of a strictly feasible point. The proposed PMQSopt algorithm is a sequentially strongly convex programming method that is readily implementable. Numerical experiments illustrate its practical performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PMQSopt gives a workable new template for weakly convex constrained stochastic programs with O(T^{-1/4}) expected rates, but the high-probability bound on the Lagrangian gradient is noticeably weaker than the violation bounds.

read the letter

The paper introduces PMQSopt, which folds quadratic approximations into a proximal method of multipliers for expectation-valued weakly convex problems. It reports expected O(T^{-1/4}) rates on the squared Moreau-envelope gradient norm, constraint violation, and complementarity violation after T steps, plus high-probability versions of those bounds under weak convexity and strict feasibility. The method is presented as sequentially strongly convex and easy to code, with some numerical checks included. That combination of proximal multipliers and quadratic steps inside a stochastic approximation loop for this regime looks like the actual new piece; it does not collapse to prior results on the abstract description alone. The implementability claim and the experiments are the parts that feel most immediately useful. The split in the high-probability statements stands out as the clearest soft spot. The Lagrangian gradient only reaches O(T^{-1/8}) with probability 1-1/T^{2/3}, while the two violation metrics keep the faster O(T^{-1/4}) rate at probabilities 1-2/T^{2/3} and 1-3/T^{2/3}. If the noise and martingale assumptions are comparable across the three metrics, this difference hints that the concentration argument for the Moreau-envelope term is looser or uses different parameter choices. The derivations themselves are not visible here, so it is hard to tell whether the gap is an artifact of the proof technique or something more fundamental. The two standing assumptions are mild and standard, which helps generality but also means the rates rest on those conditions holding cleanly. This is the kind of paper that readers working on constrained stochastic optimization or weakly convex models in machine learning would want to see. It supplies a concrete algorithmic option with explicit sample-complexity statements rather than another abstract existence result. I would send it to peer review. The core template is distinct enough that referees can check the tightness of the high-probability analysis and see whether the rates can be aligned or improved without extra assumptions.

Referee Report

1 major / 2 minor

Summary. The paper proposes the PMQSopt algorithm for weakly convex stochastic programming problems involving expectation-valued functions. It combines the proximal method of multipliers with quadratic approximations and analyzes sample complexity via three KKT metrics: the average squared norm of the gradient of the Moreau envelope of the Lagrangian, average constraint violation, and average complementarity violation. Expected rates of O(T^{-1/4}) are claimed for all three after T iterations, with high-probability bounds of O(T^{-1/8}) for the Lagrangian gradient (prob. 1-1/T^{2/3}) and O(T^{-1/4}) for the violations (probs. 1-2/T^{2/3} and 1-3/T^{2/3}), under weak convexity and existence of a strictly feasible point.

Significance. If the rates hold, the work provides a new, implementable stochastic approximation method for constrained weakly convex problems under mild assumptions, filling a gap between convex SA methods and general nonconvex ones. The dual expected and high-probability guarantees, plus the sequentially strongly convex subproblem structure, add both theoretical and practical value for applications in machine learning and optimization.

major comments (1)

[Abstract] Abstract: The high-probability bound for the gradient of the Lagrangian is O(T^{-1/8}) with probability 1-1/T^{2/3}, while constraint and complementarity violations achieve the faster O(T^{-1/4}) with probabilities 1-2/T^{2/3} and 1-3/T^{2/3}. This selective rate degradation for the Moreau-envelope gradient metric is not explained. If the variance/martingale assumptions are identical across metrics, the analysis appears to apply a weaker concentration step or different parameter tuning for the gradient bound; the full proof should clarify the source of the looseness or tighten it to O(T^{-1/4}).

minor comments (2)

[Abstract] The abstract states that numerical experiments illustrate practical performance but provides no details on test problems, dimensions, or baselines; adding a brief description would improve clarity.
Ensure consistent use of mathcal{O} notation and explicit statement of all probability exponents throughout the manuscript.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading and constructive feedback on our manuscript. We address the major comment point by point below.

read point-by-point responses

Referee: The high-probability bound for the gradient of the Lagrangian is O(T^{-1/8}) with probability 1-1/T^{2/3}, while constraint and complementarity violations achieve the faster O(T^{-1/4}) with probabilities 1-2/T^{2/3} and 1-3/T^{2/3}. This selective rate degradation for the Moreau-envelope gradient metric is not explained. If the variance/martingale assumptions are identical across metrics, the analysis appears to apply a weaker concentration step or different parameter tuning for the gradient bound; the full proof should clarify the source of the looseness or tighten it to O(T^{-1/4}).

Authors: We agree the abstract and convergence theorem statement do not explain the rate difference. The O(T^{-1/8}) high-probability bound for the Moreau-envelope gradient arises because this metric requires controlling a recursive sequence of quadratic approximation errors inside the proximal multiplier updates; we apply Freedman's martingale inequality to the associated stochastic gradient noise, and optimizing the deviation parameter for failure probability 1/T^{2/3} produces the square-root degradation. Constraint and complementarity violations are direct non-negative averages that concentrate via a standard Hoeffding bound without the recursive layer, preserving O(T^{-1/4}). The differing probabilities (1-1/T^{2/3} vs. 1-2/T^{2/3}, 1-3/T^{2/3}) follow from the number of union-bound events in each proof. Under only weak convexity and strict feasibility we cannot tighten the gradient bound to O(T^{-1/4}) without stronger variance or boundedness assumptions that would restrict applicability. We will revise by adding a clarifying paragraph after Theorem 3.2 and a remark in the appendix proof. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation; algorithm and rates derived independently from assumptions

full rationale

The paper introduces PMQSopt by combining the proximal method of multipliers with quadratic approximations of the stochastic problem. Expected O(T^{-1/4}) rates for the three KKT metrics (Moreau-envelope gradient norm, constraint violation, complementarity violation) and the high-probability bounds are obtained by direct analysis of the algorithm iterates under the stated assumptions of weak convexity and existence of a strictly feasible point. No self-definitional steps appear, no fitted parameters are relabeled as predictions, and no load-bearing claims reduce to self-citations or prior ansatzes by the same authors. The derivation chain remains self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The convergence claims rest on two domain assumptions that are not derived inside the paper.

axioms (2)

domain assumption weak convexity of all problem functions
Invoked as condition (i) to obtain the stated rates
domain assumption existence of a strictly feasible point
Invoked as condition (ii) to obtain the stated rates

pith-pipeline@v0.9.0 · 5579 in / 1296 out tokens · 76480 ms · 2026-05-07T15:49:38.419946+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

30 extracted references · 2 canonical work pages

[1]

Beck,First-Order Methods in Optimization, SIAM, Philadelphia, 2017

A. Beck,First-Order Methods in Optimization, SIAM, Philadelphia, 2017. 37

2017
[2]

D. Boob, Q. Deng and G. Lan, Stochastic first-order methods for convex and nonconvex func- tional constrained optimization,Mathematical Programming, 197 (2023), 215–279

2023
[3]

K. L. Chung, On a stochastic approximation method,Annals of Mathematical Statistics, 25 (1954), 463–483

1954
[4]

Workbench Team C, A Marketing Dataset, 2008.http://www.causality.inf.ethz.ch/ data/CINA.html

2008
[5]

F. E. Curtis, M. J. O’Neill and D. P. Robinson, Worst-case complexity of an SQP method for nonlinear equality constrained stochastic optimization,Mathematical Programming, 205 (2024), 431–483

2024
[6]

Davis and D

D. Davis and D. Drusvyatskiy, Stochastic model-based minimization of weakly convex func- tions,SIAM Journal on Optimization, 29:1(2019), 207–239

2019
[7]

Drusvyatskiy and A

D. Drusvyatskiy and A. S. Lewis, Error bounds, quadratic growth, and linear convergence of proximal methods,Mathematics of Operations Research, 43:3 (2018), 919–948

2018
[8]

Ghadimi and G

S. Ghadimi and G. Lan, Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization, I: a generic algorithmic framework,SIAM Journal on Op- timization, 22 (2012), 1469–1492

2012
[9]

Ghadimi and G

S. Ghadimi and G. Lan, Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization, II: shrinking procedures and optimal algorithms,SIAM Journal on Optimization, 23:4 (2013), 2061–2089

2013
[10]

Ghadimi and G

S. Ghadimi and G. Lan, Stochastic first-and zeroth-order methods for nonconvex stochastic programming,SIAM Journal on Optimization, 23:4 (2013), 2341–2368

2013
[11]

Ghadimi, G

S. Ghadimi, G. Lan and H. Zhang, Mini-batch stochastic approximation methods for noncon- vex stochastic composite optimization,Mathematical Programming, 155:1 (2016), 267–305

2016
[12]

Guyon, S

I. Guyon, S. Gunn, A. Ben-Hur and G. Dror, Result analysis of the NIPS 2003 feature selection challenge,Advances in Neural Information Processing Systems, V ol. 17, MIT Press, 2004, pp. 545–552

2003
[13]

Jin and X

L. Jin and X. Wang, A stochastic primal-dual method for a class of nonconvex constrained optimization,Computational Optimization and Applications, 83 (2022), 143–180

2022
[14]

Lan,First-Order and Stochastic Optimization Methods for Machine Learning, Springer Series in the Data Sciences, Springer, Cham, Switzerland, 2020

G. Lan,First-Order and Stochastic Optimization Methods for Machine Learning, Springer Series in the Data Sciences, Springer, Cham, Switzerland, 2020

2020
[15]

Lan, An optimal method for stochastic composite optimization,Mathematical Program- ming, 133:1 (2012), 365–397

G. Lan, An optimal method for stochastic composite optimization,Mathematical Program- ming, 133:1 (2012), 365–397

2012
[16]

G. Lan, A. Nemirovski and A. Shapiro, Validation analysis of mirror descent stochastic ap- proximation method,Mathematical Programming, 134:2 (2012), 425–458

2012
[17]

Lan and Z

G. Lan and Z. Zhou, Algorithms for stochastic optimization with expectation constraints,Com- putational Optimization and Applications, 76 (2020), 461–498. 38

2020
[18]

LeCun, C

Y . LeCun, C. Cortes and C. J. C. Burges, The MNIST Database of Handwritten Digits, 2010. http://yann.lecun.com/exdb/mnist/

2010
[19]

Li, P.-Y

Z. Li, P.-Y . Chen, S. Liu, S. Lu and Y . Xu, Stochastic inexact augmented Lagrangian method for nonconvex expectation constrained optimization,Computational Optimization and Appli- cations, 87 (2024), 117–147

2024
[20]

Nemirovski, A

A. Nemirovski, A. Juditsky, G. Lan and A. Shapiro, Robust stochastic approximation approach to stochastic programming,SIAM Journal on Optimization, 19 (2009), 1574–1609

2009
[21]

B. T. Polyak and A. B. Juditsky, Acceleration of stochastic approximation by averaging,SIAM Journal on Control and Optimization, 30 (1992), 838–855

1992
[22]

Robbins and S

H. Robbins and S. Monro, A stochastic approximation method,Annals of Mathematical Statis- tics, 22 (1951), 400–407

1951
[23]

R. T. Rockafellar,Convex Analysis, Princeton University Press, Princeton, 1970

1970
[24]

Sacks, Asymptotic distribution of stochastic approximation,Annals of Mathematical Statis- tics, 29 (1958), 373–409

J. Sacks, Asymptotic distribution of stochastic approximation,Annals of Mathematical Statis- tics, 29 (1958), 373–409

1958
[25]

Shalev-Shwartz, Online learning and online convex optimization,Foundations and Trends in Machine Learning, 4:2 (2011), 107–194

S. Shalev-Shwartz, Online learning and online convex optimization,Foundations and Trends in Machine Learning, 4:2 (2011), 107–194

2011
[26]

Q. Shi, X. Wang and H. Wang, A momentum-based linearized augmented Lagrangian method for nonconvex constrained stochastic optimization,Mathematics of Operations Research, pub- lished online, 23 Jan 2025,https://doi.org/10.1287/moor.2022.0193

work page doi:10.1287/moor.2022.0193 2025
[27]

Xiao, Dual averaging methods for regularized stochastic learning and online optimization, Journal of Machine Learning Research, 11 (2010), 2543–2596

L. Xiao, Dual averaging methods for regularized stochastic learning and online optimization, Journal of Machine Learning Research, 11 (2010), 2543–2596

2010
[28]

Xu, Primal-dual stochastic gradient method for convex programs with many functional constraints,SIAM Journal on Optimization, 30:2 (2020), 1664–1692

Y . Xu, Primal-dual stochastic gradient method for convex programs with many functional constraints,SIAM Journal on Optimization, 30:2 (2020), 1664–1692

2020
[29]

H. Yu, M. J. Neely and X. Wei, Online convex optimization with stochastic constraints, arXiv:1708.03741v1, 2017

work page arXiv 2017
[30]

Zhang, Y

L. Zhang, Y . Zhang, X. Xiao and J. Wu, Stochastic approximation proximal method of multi- pliers for convex stochastic programming,Mathematics of Operations Research, 48:1 (2023), 177–193. 39

2023