pith. machine review for the scientific record. sign in

arxiv: 2605.11627 · v1 · submitted 2026-05-12 · 🧮 math.OC

Recognition: 2 theorem links

· Lean Theorem

Proximal Limited-Memory Quasi-Newton Methods for Nonsmooth Nonconvex Optimization

Alberto De Marchi, Christian Kanzow, Simeon vom Dahl

Pith reviewed 2026-05-13 01:25 UTC · model grok-4.3

classification 🧮 math.OC
keywords proximal quasi-Newtonlimited-memory methodsnonsmooth nonconvex optimizationglobal convergenceKurdyka-Lojasiewicz propertyproximal operatorscompact representation
0
0 comments X

The pith

A proximal limited-memory quasi-Newton scheme converges globally for nonsmooth nonconvex problems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces an algorithm that minimizes the sum of a continuously differentiable function and a proper lower semicontinuous prox-bounded possibly nonsmooth function, where both may be nonconvex. It combines proximal operators with limited-memory quasi-Newton updates and adapts a regularization parameter to enforce sufficient decrease at each step. Global convergence to stationary points is shown under mild conditions on the two functions, followed by convergence of the full sequence with rates when the objective satisfies the Kurdyka-Lojasiewicz property. The subproblems are solved efficiently by using compact representations of the limited-memory updates, and numerical tests indicate practical speedups over existing approaches.

Core claim

We introduce a proximal limited-memory quasi-Newton scheme for minimizing the sum of a continuously differentiable function and a proper, lower semicontinuous and prox-bounded, possibly nonsmooth, function. Both functions might be nonconvex. The method builds upon the computation of scaled proximal operators and is globalized by adaptively updating a regularization parameter based on a criterion of sufficient decrease. We prove global convergence under mild assumptions and then establish convergence of the entire sequence (with rates) under the Kurdyka--Lojasiewicz property. To efficiently solve the subproblems, we exploit the compact representation of limited-memory quasi-Newton updates. We

What carries the argument

The proximal limited-memory quasi-Newton update that uses a compact representation to approximate the inverse Hessian of the smooth part and compute the scaled proximal operator at each iteration.

If this is right

  • Global convergence to stationary points holds under the stated mild assumptions on the smooth and nonsmooth parts.
  • Convergence of the entire sequence together with explicit rates follows when the objective satisfies the Kurdyka-Lojasiewicz property.
  • Subproblems remain tractable because the limited-memory updates admit a compact representation that avoids storing full matrices.
  • A rank-one limited-memory Kleinmichel formula preserves positive definiteness under the same condition required for BFGS.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The compact representation may extend the method to problem dimensions where full-matrix storage is impossible.
  • The observed numerical speedups suggest the scheme could replace slower first-order proximal methods on large-scale instances.
  • Similar limited-memory proximal ideas might improve other nonconvex algorithms that rely on proximal mappings.

Load-bearing premise

The nonsmooth function is proper, lower semicontinuous and prox-bounded, the smooth part is continuously differentiable, and the adaptive regularization guarantees sufficient decrease at every step.

What would settle it

A concrete example or counterexample in which the iterates satisfy the sufficient-decrease condition at every iteration yet do not approach a stationary point, or in which the full sequence fails to converge when the objective satisfies the Kurdyka-Lojasiewicz property.

Figures

Figures reproduced from arXiv: 2605.11627 by Alberto De Marchi, Christian Kanzow, Simeon vom Dahl.

Figure 1
Figure 1. Figure 1: Comparison on regularized logistic regression problems, in terms of relative runtimes. Accuracy level: low (left) and high (right). Regularization: ℓ1 (top) and capped-ℓ1 (bottom). The effect of adopting a nonmonotone globalization is illustrated with pair-wise relative run￾time profiles in [PITH_FULL_IMAGE:figures/full_fig_p027_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of solver variants with monotone and nonmonotone globalization, in terms of relative runtimes. Results aggregated for all logistic regression problems. Accuracy level: low (left) and high (right). Sample size: 120 problems for each accuracy level. 7.2 Regularized Student’s t-regression We consider the regularized Student’s t-regression problem, which seeks to minimize x 1 m Xm i=1 log  1 + (Ax … view at source ↗
Figure 3
Figure 3. Figure 3: Comparison on regularized Student’s t-regression problems, in terms of relative run￾times. Accuracy level: low (left) and high (right). Regularization: ℓ1 (top) and capped-ℓ1 (bottom). regularizer ℓ1 capped ℓ1 accuracy low high low high SPG-nm 0.004 47.992 0.002 0.138 PANOCplus-nm 0.006 24.737 0.003 1.529 R2N-nm 0.035 0.671 0.003 0.371 RPQN-lkm-nm 0.003 0.122 0.002 0.096 RPQN-lsr1-nm 0.002 1.340 0.002 0.85… view at source ↗
Figure 4
Figure 4. Figure 4: Comparison on discretized obstacle problems, in terms of relative runtimes. Accuracy level: low (left) and high (right). 8 Conclusion A proximal method with adaptive quadratic regularization and limited-memory quasi-Newton models was introduced in this work. Global convergence was established under weak assump￾tions; in particular, no global Lipschitz continuity is required. Moreover, convergence of the it… view at source ↗
read the original abstract

We introduce a proximal limited--memory quasi--Newton scheme for minimizing the sum of a continuously differentiable function and a proper, lower semicontinuous and prox-bounded, possibly nonsmooth, function. Both functions might be nonconvex. The method builds upon the computation of scaled proximal operators and is globalized by adaptively updating a regularization parameter based on a criterion of sufficient decrease. We prove global convergence under mild assumptions and then establish convergence of the entire sequence (with rates) under the Kurdyka--Lojasiewicz property. To efficiently solve the subproblems, we exploit the compact representation of limited-memory quasi-Newton updates. We derive also a compact representation of the limited--memory Kleinmichel formula, a rank-one quasi-Newton scheme that preserves positive definiteness under the same condition as the BFGS update. Numerical results show a significant speed up compared to other methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces a proximal limited-memory quasi-Newton algorithm for minimizing f(x) + g(x) where f is continuously differentiable (possibly nonconvex) and g is proper, lower semicontinuous, and prox-bounded (possibly nonsmooth and nonconvex). The method employs scaled proximal operators with limited-memory quasi-Newton updates, globalized by an adaptive regularization parameter enforcing sufficient decrease. It claims global convergence to stationary points under mild assumptions, followed by convergence of the full sequence (with rates) under the Kurdyka-Łojasiewicz property. Compact representations of the limited-memory updates are derived, including a new compact form of the Kleinmichel formula that preserves positive definiteness under the same conditions as BFGS. Numerical experiments report significant speedups relative to other methods.

Significance. If the convergence results hold, the work supplies a practical, theoretically supported extension of limited-memory quasi-Newton methods to nonsmooth nonconvex problems, addressing a gap where such approximations are often needed for scalability. The compact representations and Kleinmichel variant are concrete contributions that could aid implementation, and the reported numerical gains indicate potential practical impact in large-scale optimization.

major comments (2)
  1. [convergence analysis] The global convergence argument (abstract and convergence section) rests on the adaptive regularization ensuring sufficient decrease at each step while using the limited-memory quasi-Newton direction; however, the manuscript must explicitly verify that the compact representation preserves the descent property required for the decrease condition, as the curvature approximation could otherwise violate the assumptions used in the proximal-operator analysis.
  2. [KL convergence theorem] The claim of full-sequence convergence with rates under the KL property (abstract) depends on the subproblem solver and the properties of the scaled proximal operator; the interaction between the limited-memory update and the KL inequality needs a self-contained argument showing that the sequence remains in the region where the KL exponent applies, without additional hidden restrictions on the prox-boundedness constant.
minor comments (2)
  1. [abstract] The abstract states that both functions 'might be nonconvex' but the assumptions list only C1 for the smooth part; clarify whether the nonconvexity of f is fully covered by the stated hypotheses or requires an additional local Lipschitz gradient assumption.
  2. [algorithm description] Notation for the limited-memory matrices (e.g., the compact form of the Kleinmichel update) should be introduced with an explicit reference to the standard L-BFGS two-loop recursion to aid readers unfamiliar with the rank-one variant.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading of our manuscript and the constructive comments. We are pleased that the referee recognizes the significance of extending limited-memory quasi-Newton methods to nonsmooth nonconvex problems. We address the major comments below and will revise the manuscript accordingly to incorporate the suggested clarifications.

read point-by-point responses
  1. Referee: [convergence analysis] The global convergence argument (abstract and convergence section) rests on the adaptive regularization ensuring sufficient decrease at each step while using the limited-memory quasi-Newton direction; however, the manuscript must explicitly verify that the compact representation preserves the descent property required for the decrease condition, as the curvature approximation could otherwise violate the assumptions used in the proximal-operator analysis.

    Authors: We agree that an explicit verification is beneficial for clarity. In the revised manuscript, we will add a short lemma (or remark) in the global convergence section demonstrating that the compact representations of the limited-memory updates (including the new Kleinmichel form) are mathematically equivalent to the standard updates and therefore preserve the curvature conditions, positive definiteness, and descent property required for the sufficient decrease condition. This equivalence follows directly from the derivations in Section 3, and the added lemma will make the connection to the proximal-operator analysis explicit without altering the existing proof structure. revision: yes

  2. Referee: [KL convergence theorem] The claim of full-sequence convergence with rates under the KL property (abstract) depends on the subproblem solver and the properties of the scaled proximal operator; the interaction between the limited-memory update and the KL inequality needs a self-contained argument showing that the sequence remains in the region where the KL exponent applies, without additional hidden restrictions on the prox-boundedness constant.

    Authors: We thank the referee for this observation. In the revised version, we will expand the KL convergence section with a self-contained argument that explicitly shows the sequence remains in the relevant neighborhood for the KL inequality. This will use the adaptive regularization and the prox-boundedness assumption to bound the iterates, confirming that the KL exponent applies uniformly without imposing any new or hidden restrictions on the prox-boundedness constant. The argument will also tie the subproblem solver properties back to the scaled proximal operator to support the full-sequence convergence and rates as stated. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The derivation chain consists of a standard proximal quasi-Newton framework globalized via adaptive regularization to enforce sufficient decrease, followed by a separate KL-property argument for full-sequence convergence and rates. These steps rely on established proximal-operator properties, lower semicontinuity, and prox-boundedness assumptions that are stated independently of the algorithm's limited-memory representation or Kleinmichel formula. The compact representations are introduced solely for subproblem efficiency and do not enter the convergence proofs. No equation or claim reduces to a fitted parameter, self-defined quantity, or load-bearing self-citation whose validity depends on the present paper.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Based solely on the abstract, the central claim rests on standard nonsmooth optimization assumptions and the Kurdyka-Lojasiewicz property; no explicit free parameters or invented entities are identifiable.

axioms (2)
  • domain assumption The nonsmooth function is proper, lower semicontinuous, and prox-bounded
    Invoked to ensure the proximal operator is well-defined and the method is applicable.
  • domain assumption Kurdyka-Lojasiewicz property
    Used to obtain convergence rates for the entire sequence.

pith-pipeline@v0.9.0 · 5450 in / 1328 out tokens · 38575 ms · 2026-05-13T01:25:17.566912+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages

  1. [1]

    F. J. Arag´ on Artacho, R. M. T. Fleming, and P. T. Vuong. Accelerating the DC algorithm for smooth functions.Mathematical Programming, 169(1):95–118, 2018. doi:10.1007/s10107-017-1180-1

  2. [2]

    A. Y. Aravkin, R. Baraldi, and D. Orban. A proximal quasi-Newton trust-region method for nonsmooth regularized optimization.SIAM Journal on Optimization, 32(2):900–929, 2022. doi:10.1137/21M1409536

  3. [3]

    Attouch, J

    H. Attouch, J. Bolte, P. Redont, and A. Soubeyran. Proximal alternating minimization and projec- tion methods for nonconvex problems: An approach based on the Kurdyka- Lojasiewicz inequality. Mathematics of Operations Research, 35:438–457, 2010. doi:10.1287/moor.1100.0449

  4. [4]

    Becker and J

    S. Becker and J. Fadili. A quasi-Newton proximal splitting method. InAdvances in Neural Infor- mation Processing Systems, volume 25, 2012

  5. [5]

    Becker, J

    S. Becker, J. Fadili, and P. Ochs. On quasi-Newton forward-backward splitting: Proximal calculus and convergence.SIAM Journal on Optimization, 29(4):2445–2481, 2019. doi:10.1137/18M1167152

  6. [6]

    Bian and X

    W. Bian and X. Chen. Linearly constrained non-Lipschitz optimization for image restoration.SIAM Journal on Imaging Sciences, 8(4):2294–2322, 2015. doi:10.1137/140985639. 31

  7. [7]

    E. G. Birgin, J. M. Mart´ ınez, and M. Raydan. Nonmonotone spectral projected gra- dient methods on convex sets.SIAM Journal on Optimization, 10(4):1196–1211, 2000. doi:10.1137/S1052623497330963

  8. [8]

    E. G. Birgin, J. M. Mart´ ınez, and M. Raydan. Spectral projected gradient methods: Review and perspectives.Journal of Statistical Software, 60(3):1–21, 2014. doi:10.18637/jss.v060.i03

  9. [9]

    Bolte, A

    J. Bolte, A. Daniilidis, A. Lewis, and M. Shiota. Clarke subgradients of stratifiable functions.SIAM Journal on Optimization, 18:556–572, 2007. doi:10.1137/060670080

  10. [10]

    Bolte, T

    J. Bolte, T. Nguyen, J. Peypouquet, and B. Suter. From error bounds to the complexity of first-order descent methods for convex functions.Mathematical Programming, 165, 10 2016. doi:10.1007/s10107-016-1091-6

  11. [11]

    R. H. Byrd, J. Nocedal, and R. B. Schnabel. Representations of quasi-Newton matrices and their use in limited memory methods.Mathematical Programming, 63(1):129–156, 1994. doi:10.1007/BF01582063

  12. [12]

    De Marchi

    A. De Marchi. Proximal gradient methods beyond monotony.Journal of Nonsmooth Analysis and Optimization, 4, 2023. doi:10.46298/jnsao-2023-10290

  13. [13]

    De Marchi and A

    A. De Marchi and A. Themelis. Proximal gradient algorithms under local Lipschitz gradient conti- nuity.Journal of Optimization Theory and Applications, 194(3):771–794, 2022. doi:10.1007/s10957- 022-02048-5

  14. [14]

    Diouane, M

    Y. Diouane, M. L. Habiboullah, and D. Orban. A proximal modified quasi-Newton method for nonsmooth regularized optimization.SIAM Journal on Optimization, 36(2):534–563, 2026. doi:10.1137/24M169761X

  15. [15]

    Gollier, M

    M. Gollier, M. L. Habiboullah, G. Leconte, R. Baraldi, A. De Marchi, D. Orban, and Y. Diouane. RegularizedOptimization.jl: A Julia framework for regularized and nonsmooth optimization.Jour- nal of Open Source Software, 11(118):9344, 2026. doi:10.21105/joss.09344

  16. [16]

    Grippo, F

    L. Grippo, F. Lampariello, and S. Lucidi. A nonmonotone line search technique for Newton’s method.SIAM Journal on Numerical Analysis, 23(4):707–716, 1986. doi:10.1137/0723046

  17. [17]

    X. Jia, C. Kanzow, and P. Mehlitz. Convergence analysis of the proximal gradient method in the presence of the Kurdyka– Lojasiewicz property without global Lipschitz assumptions.SIAM Journal on Optimization, 33(4):3038–3056, 2023. doi:10.1137/23M1548293

  18. [18]

    Jin.Lectures on Nonsmooth Optimization

    Q. Jin.Lectures on Nonsmooth Optimization. Springer, 2025. doi:10.1007/978-3-031-91417-1

  19. [19]

    Kanzow and T

    C. Kanzow and T. Lechner. Efficient regularized proximal quasi-Newton methods for large-scale nonconvex composite optimization problems.Pacific Journal of Optimization, 20:537–568, 2024. doi:10.61208/pjo-2023-036

  20. [20]

    Kanzow and L

    C. Kanzow and L. Lehmann. Convergence of nonmonotone proximal gradient methods under the Kurdyka- Lojasiewicz property without a global Lipschitz assumption.Journal of Optimization Theory and Applications, 207(1):4, 2025. doi:10.1007/s10957-025-02762-w

  21. [21]

    Kanzow and P

    C. Kanzow and P. Mehlitz. Convergence properties of monotone and nonmonotone proximal gradient methods revisited.Journal of Optimization Theory and Applications, 195:1–23, 2022. doi:10.1007/s10957-022-02101-3

  22. [22]

    Kanzow and D

    C. Kanzow and D. Steck. Regularization of limited memory quasi-Newton methods for large- scale nonconvex minimization.Mathematical Programming Computation, 15(3):417–444, 2023. doi:10.1007/s12532-023-00238-4

  23. [23]

    Kleinmichel

    H. Kleinmichel. Quasi-Newton-Verfahren vom Rang-Eins-Typ zur L¨ osung unrestringierter Min- imierungsprobleme, Teil 1.Numerische Mathematik, 38(2):219–228, 1981. doi:10.1007/BF01397091

  24. [24]

    Kleinmichel

    H. Kleinmichel. Quasi-Newton-Verfahren vom Rang-Eins-Typ zur L¨ osung unrestringierter Min- imierungsprobleme, Teil 2.Numerische Mathematik, 38(2):229–244, 1981. doi:10.1007/BF01397092. 32

  25. [25]

    Koh, S.-J

    K. Koh, S.-J. Kim, and S. Boyd. An interior-point method for large-scale l1-regularized logistic regression.Journal of Machine Learning Research, 8(54):1519–1555, 2007

  26. [26]

    Lechner.Proximal Methods for Nonconvex Composite Optimization Problems

    T. Lechner.Proximal Methods for Nonconvex Composite Optimization Problems. PhD Thesis, Institute of Mathematics, University of W¨ urzburg, 2022. doi:10.25972/OPUS-28907

  27. [27]

    J. Lee, Y. Sun, and M. Saunders. Proximal Newton-type methods for minimizing composite func- tions.SIAM Journal on Optimization, 24:1420–1443, 2014. doi:10.1137/130921428

  28. [28]

    Li and Z

    H. Li and Z. Lin. Accelerated proximal gradient methods for nonconvex programming. InAdvances in Neural Information Processing Systems 28, volume 28, 2015

  29. [29]

    Mathematical Programming , author =

    D. Liu and J. Nocedal. On the limited memory BFGS method for large scale optimization.Mathe- matical Programming, 45:503–528, 1989. doi:10.1007/BF01589116

  30. [30]

    R. Liu, S. Pan, Y. Wu, and X. Yang. An inexact regularized proximal Newton method for nonconvex and nonsmooth optimization.Computational Optimization and Applications, 88(2):603–641, 2024. doi:10.1007/s10589-024-00560-0

  31. [31]

    Mordukhovich.Variational Analysis and Applications, volume 30

    B. Mordukhovich.Variational Analysis and Applications, volume 30. Springer, 2018. doi:10.1007/978-3-319-92775-6

  32. [32]

    J. Nocedal. Updating quasi-Newton matrices with limited storage.Mathematics of Computation, 35(151):773–782, 1980. doi:10.1090/S0025-5718-1980-0572855-7

  33. [33]

    Rockafellar and R

    R. Rockafellar and R. J.-B. Wets.Variational Analysis. Springer-Verlag, Heidelberg, Berlin, New York, 1998

  34. [34]

    Physica D: Nonlinear Phenomena60(1-4), 259–268 (1992)

    L. Rudin, S. Osher, and E. Fatemi. Nonlinear total variation based noise removal algorithms. Physica D: Nonlinear Phenomena, 60:259–268, 1992. doi:10.1016/0167-2789(92)90242-F

  35. [35]

    Scheinberg and X

    K. Scheinberg and X. Tang. Practical inexact proximal quasi-Newton method with global complexity analysis.Mathematical Programming, 160(1–2):495–529, 2016. doi:10.1007/s10107-016-0997-3

  36. [36]

    Spellucci

    P. Spellucci. A modified rank one update which converges Q-superlinearly.Computational Opti- mization and Applications, 19:273–296, 2001. doi:10.1023/A:1011259905470

  37. [37]

    Stella, A

    L. Stella, A. Themelis, P. Sopasakis, and P. Patrinos. A simple and efficient algorithm for nonlinear model predictive control. In2017 IEEE 56th Annual Conference on Decision and Control (CDC), pages 1939–1944, 2017. doi:10.1109/CDC.2017.8263933

  38. [38]

    Themelis, L

    A. Themelis, L. Stella, and P. Patrinos. Forward-backward envelope for the sum of two nonconvex functions: Further properties and nonmonotone linesearch algorithms.SIAM Journal on Optimiza- tion, 28(3):2274–2303, 2018. doi:10.1137/16M1080240

  39. [39]

    Tibshirani

    R. Tibshirani. Regression shrinkage and selection via the lasso.Journal of the Royal Statistical Society. Series B (Methodological), 58(1):267–288, 1996. doi:10.1111/j.2517-6161.1996.tb02080.x

  40. [40]

    Ueda and N

    K. Ueda and N. Yamashita. A regularized Newton method without line search for unconstrained op- timization.Computational Optimization and Applications, 59(1):321–351, 2014. doi:10.1007/s10589- 014-9656-x

  41. [41]

    vom Dahl and C

    S. vom Dahl and C. Kanzow. An inexact regularized proximal Newton method without line search. Computational Optimization and Applications, 89(3):585–624, 2024. doi:10.1007/s10589-024-00600- 9

  42. [42]

    S. J. Wright, R. D. Nowak, and M. A. T. Figueiredo. Sparse reconstruction by sep- arable approximation.IEEE Transactions on Signal Processing, 57(7):2479–2493, 2009. doi:10.1109/TSP.2009.2016892

  43. [43]

    Zhang and W

    H. Zhang and W. W. Hager. A nonmonotone line search technique and its applica- tion to unconstrained optimization.SIAM Journal on Optimization, 14(4):1043–1056, 2004. doi:10.1137/S1052623403428208. 33