A Quadratic-Approximation-Based Stochastic Approximation Method for Weakly Convex Stochastic Programming
Pith reviewed 2026-05-07 15:49 UTC · model grok-4.3
The pith
PMQSopt integrates proximal multipliers with quadratic approximations to reach O(T^{-1/4}) expected convergence on KKT metrics for weakly convex stochastic programs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PMQSopt is constructed by integrating the proximal method of multipliers with quadratic approximations of the original stochastic problem. The convergence of the algorithm is characterized by three metrics associated with the epsilon-KKT conditions: the average squared norm of the gradient of the Moreau envelope of the Lagrangian, the average constraint violation, and the average complementarity violation. For each of these metrics the paper establishes an expected convergence rate of O(T^{-1/4}) after T iterations, together with explicit high-probability bounds of O(T^{-1/8}) or O(T^{-1/4}) that hold with probability at least 1 minus a small power of 1/T.
What carries the argument
The proximal method of multipliers combined with quadratic approximations of the stochastic objective and constraints, which produces a sequence of strongly convex subproblems whose solutions are used to update the multipliers and the quadratic model.
If this is right
- The total number of stochastic gradient evaluations required to reach a given accuracy is characterized directly by the iteration count T.
- The method produces iterates that satisfy high-probability bounds on the Lagrangian gradient, constraint violation, and complementarity violation simultaneously.
- The algorithm remains applicable to constrained stochastic programs whose functions satisfy only weak convexity rather than strong convexity or smoothness.
- Numerical performance is illustrated on test problems, confirming that the method can be implemented and run without additional tuning beyond the stated assumptions.
Where Pith is reading between the lines
- The quadratic-approximation step may allow reuse of existing convex solvers for the subproblems, lowering the barrier to practical deployment in large-scale settings.
- The high-probability bounds suggest the method could be paired with variance-reduction or mini-batch techniques to tighten the probabilistic guarantees without changing the core iteration.
- Because the analysis separates the handling of the Lagrangian gradient from the feasibility metrics, the same framework could be adapted to other penalty or augmented-Lagrangian schemes for nonconvex stochastic programs.
Load-bearing premise
All convergence claims rest on the assumption that every problem function is weakly convex and that a strictly feasible point exists.
What would settle it
Construct a weakly convex stochastic program with a strictly feasible point, run PMQSopt for large T, and observe that any of the three KKT metrics fails to improve at rate O(T^{-1/4}) in expectation.
Figures
read the original abstract
We propose a novel stochastic approximation algorithm, termed PMQSopt, for solving weakly convex stochastic optimization problems involving expectation-valued functions. The algorithm is constructed by integrating the proximal method of multipliers with quadratic approximations of the original stochastic problem. We analyze the sample complexity of PMQSopt in terms of the total number of stochastic gradient evaluations required. The convergence of the algorithm is characterized by three metrics associated with the $\epsilon$-KKT conditions: the average squared norm of the gradient of the Moreau envelope of the Lagrangian, the average constraint violation, and the average complementarity violation. For each of these metrics, we establish an expected convergence rate of $\mathcal{O}(T^{-1/4})$ after $T$ iterations. Furthermore, we show that with probability at least $1-1/T^{2/3}$, the gradient of the Lagrangian satisfies an $\mathcal{O}(T^{-1/8})$ bound; with probability at least $1-2/T^{2/3}$, the constraint violation achieves an $\mathcal{O}(T^{-1/4})$ bound; and with probability at least $1-3/T^{2/3}$, the complementarity violation attains an $\mathcal{O}(T^{-1/4})$ bound. All results are established under two mild conditions: (i) weak convexity of all problem functions, and (ii) the existence of a strictly feasible point. The proposed PMQSopt algorithm is a sequentially strongly convex programming method that is readily implementable. Numerical experiments illustrate its practical performance.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes the PMQSopt algorithm for weakly convex stochastic programming problems involving expectation-valued functions. It combines the proximal method of multipliers with quadratic approximations and analyzes sample complexity via three KKT metrics: the average squared norm of the gradient of the Moreau envelope of the Lagrangian, average constraint violation, and average complementarity violation. Expected rates of O(T^{-1/4}) are claimed for all three after T iterations, with high-probability bounds of O(T^{-1/8}) for the Lagrangian gradient (prob. 1-1/T^{2/3}) and O(T^{-1/4}) for the violations (probs. 1-2/T^{2/3} and 1-3/T^{2/3}), under weak convexity and existence of a strictly feasible point.
Significance. If the rates hold, the work provides a new, implementable stochastic approximation method for constrained weakly convex problems under mild assumptions, filling a gap between convex SA methods and general nonconvex ones. The dual expected and high-probability guarantees, plus the sequentially strongly convex subproblem structure, add both theoretical and practical value for applications in machine learning and optimization.
major comments (1)
- [Abstract] Abstract: The high-probability bound for the gradient of the Lagrangian is O(T^{-1/8}) with probability 1-1/T^{2/3}, while constraint and complementarity violations achieve the faster O(T^{-1/4}) with probabilities 1-2/T^{2/3} and 1-3/T^{2/3}. This selective rate degradation for the Moreau-envelope gradient metric is not explained. If the variance/martingale assumptions are identical across metrics, the analysis appears to apply a weaker concentration step or different parameter tuning for the gradient bound; the full proof should clarify the source of the looseness or tighten it to O(T^{-1/4}).
minor comments (2)
- [Abstract] The abstract states that numerical experiments illustrate practical performance but provides no details on test problems, dimensions, or baselines; adding a brief description would improve clarity.
- Ensure consistent use of mathcal{O} notation and explicit statement of all probability exponents throughout the manuscript.
Simulated Author's Rebuttal
We thank the referee for their careful reading and constructive feedback on our manuscript. We address the major comment point by point below.
read point-by-point responses
-
Referee: The high-probability bound for the gradient of the Lagrangian is O(T^{-1/8}) with probability 1-1/T^{2/3}, while constraint and complementarity violations achieve the faster O(T^{-1/4}) with probabilities 1-2/T^{2/3} and 1-3/T^{2/3}. This selective rate degradation for the Moreau-envelope gradient metric is not explained. If the variance/martingale assumptions are identical across metrics, the analysis appears to apply a weaker concentration step or different parameter tuning for the gradient bound; the full proof should clarify the source of the looseness or tighten it to O(T^{-1/4}).
Authors: We agree the abstract and convergence theorem statement do not explain the rate difference. The O(T^{-1/8}) high-probability bound for the Moreau-envelope gradient arises because this metric requires controlling a recursive sequence of quadratic approximation errors inside the proximal multiplier updates; we apply Freedman's martingale inequality to the associated stochastic gradient noise, and optimizing the deviation parameter for failure probability 1/T^{2/3} produces the square-root degradation. Constraint and complementarity violations are direct non-negative averages that concentrate via a standard Hoeffding bound without the recursive layer, preserving O(T^{-1/4}). The differing probabilities (1-1/T^{2/3} vs. 1-2/T^{2/3}, 1-3/T^{2/3}) follow from the number of union-bound events in each proof. Under only weak convexity and strict feasibility we cannot tighten the gradient bound to O(T^{-1/4}) without stronger variance or boundedness assumptions that would restrict applicability. We will revise by adding a clarifying paragraph after Theorem 3.2 and a remark in the appendix proof. revision: yes
Circularity Check
No circularity in derivation; algorithm and rates derived independently from assumptions
full rationale
The paper introduces PMQSopt by combining the proximal method of multipliers with quadratic approximations of the stochastic problem. Expected O(T^{-1/4}) rates for the three KKT metrics (Moreau-envelope gradient norm, constraint violation, complementarity violation) and the high-probability bounds are obtained by direct analysis of the algorithm iterates under the stated assumptions of weak convexity and existence of a strictly feasible point. No self-definitional steps appear, no fitted parameters are relabeled as predictions, and no load-bearing claims reduce to self-citations or prior ansatzes by the same authors. The derivation chain remains self-contained.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption weak convexity of all problem functions
- domain assumption existence of a strictly feasible point
Forward citations
Cited by 1 Pith paper
-
A Fletcher's Augmented Lagrangian-Based Stochastic First-Order Method for Nonconvex Equality-Constrained Optimization
New stochastic method based on decomposed search directions and Fletcher's augmented Lagrangian achieves O(ε^{-3}) expected oracle complexity for nonconvex equality-constrained optimization.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.