pith. machine review for the scientific record. sign in

arxiv: 2605.11929 · v1 · submitted 2026-05-12 · 🧮 math.OC

Recognition: no theorem link

Convergence of zeroth-order proximal point algorithms in the high-temperature regime

Cesare Molinari, Emanuele Naldi, Hippolyte Labarri\`ere, Silvia Villa

Pith reviewed 2026-05-13 05:14 UTC · model grok-4.3

classification 🧮 math.OC
keywords zeroth-order optimizationproximal point algorithmnon-convex optimizationsmoothed objectiveMonte Carlo samplingconvergence analysis
0
0 comments X

The pith

ZOPPA converges at any fixed positive temperature by acting as the exact proximal point method on a smoothed auxiliary objective.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proves that the Zeroth-Order Proximal Point Algorithm converges under minimal assumptions on non-convex black-box objectives by fixing the temperature parameter at any positive value. It achieves this by reinterpreting ZOPPA as the exact proximal point algorithm applied to an auxiliary smoothed version of the objective rather than an inexact approximation of the original function. Explicit bounds are derived that relate minimizers and values of the smoothed problem back to the original objective. The same convergence holds when the proximal operator is replaced by its Monte Carlo sample estimate at fixed temperature.

Core claim

ZOPPA is equivalent to the exact proximal point algorithm on an auxiliary objective obtained by smoothing the original function at fixed positive temperature. Under standard assumptions that make the proximal operator well-defined, the sequence converges to a minimizer of this smoothed objective. Explicit guarantees connect the smoothed problem to the original objective, and the same convergence holds for the sampled S-ZOPPA estimator at fixed temperature.

What carries the argument

The auxiliary smoothed objective at fixed positive temperature, which turns ZOPPA into an exact proximal point method.

If this is right

  • Convergence holds without driving temperature to zero.
  • Explicit quantitative links exist between smoothed and original objectives.
  • Sampled Monte Carlo versions converge at fixed temperature.
  • Analysis applies under only the minimal conditions needed for proximal operators to exist.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Fixed temperature removes the variance explosion that occurs when sampling at vanishing temperature.
  • The smoothing view may let practitioners choose temperature to balance bias and sampling cost independently of convergence.
  • Similar exact-on-smoothed interpretations could be tested on other zeroth-order proximal variants.

Load-bearing premise

The objective allows the proximal operator of the smoothed auxiliary function to be well-defined at any fixed positive temperature.

What would settle it

A non-convex function satisfying standard lower-semicontinuity and coercivity where the ZOPPA iterates diverge or fail to approach a minimizer of the corresponding smoothed objective at some fixed positive temperature.

Figures

Figures reproduced from arXiv: 2605.11929 by Cesare Molinari, Emanuele Naldi, Hippolyte Labarri\`ere, Silvia Villa.

Figure 1
Figure 1. Figure 1: Sampling difficulty as δ decreases for f : x 7→ |x|, with x = 2 and λ = 1. Left and center: comparison of the objective weight exp(−f /δ) and the Gaussian proposal weight for large and small δ. Right: classical prox, exact ZOPO, and S-ZOPO approximation as functions of δ. • Role of the stepsize parameter λ. We investigate in detail the effect of the parameter λ. On the theoretical side, we prove a convexif… view at source ↗
Figure 2
Figure 2. Figure 2: Comparison between fixed λ and decreasing λ with δ = 1 > 0 fixed. S-ZOPO as an importance sampling estimator. Interestingly, S-ZOPO can be seen as a Self Normalized Importance Sampling (SNIS) estimator [23] of Ey∼µx [y], µx(y) ∝ exp(−f(y)/δ) exp(− 1 2λδ ∥x − y∥ 2 ), with the proposal density N (x, λδI). This class of estimators is well known in the literature and was previously referred to as ratio estimat… view at source ↗
Figure 3
Figure 3. Figure 3: Effect of the temperature δ on sampling for f : x 7→ |x|, with x = 2 and λ = 1. Left: ZOPO, S-ZOPO, classical prox, and mean ESS. Center and right: empirical bias and variance of S-ZOPO. Such a behavior suggests that imposing δ to go to 0 automatically brings S-ZOPO to the Random search regime. This proves to be critical, since a small δ causes the escaping probability (13) to vanish. It is worth stressing… view at source ↗
Figure 4
Figure 4. Figure 4: Example of convexification effect in λ. In this picture it is also shown the fact that going towards minimizers for f λ,δ moves the iterates in low energy regions also for the original f. The convexification effect is quite quick in practice for δ not too small, while for smaller δ it requires large λ. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Theoretical asymptotic variance and empirical variance of [PITH_FULL_IMAGE:figures/full_fig_p034_5.png] view at source ↗
Figure 2
Figure 2. Figure 2: The values for λ fixed are 0.01, 0.1, 1 and 10 while the schedule for the decreasing λ are in a geometric scale from 10 to 0.01 with 10 intermidiate values [PITH_FULL_IMAGE:figures/full_fig_p034_2.png] view at source ↗
read the original abstract

Efficient methods for non-convex black-box optimization largely rely on sampling. In this context, the Zeroth-Order Proximal Operator (ZOPO) and the corresponding Zeroth-Order Proximal Point Algorithm (ZOPPA) have attracted significant interest, as they combine the advantage of requiring only objective evaluations with the powerful theoretical framework of proximal point algorithms. ZOPO depends on a temperature parameter which, when going to zero, reduces ZOPO to the exact proximal operator. By exploiting this property, the vanishing-temperature regime has been leveraged in several works to obtain theoretical guarantees via inexact proximal methods. However, this regime is computationally unsustainable when sampling is used to estimate ZOPO, since the corresponding Monte Carlo estimators suffer from severe variance issues. We therefore propose a comprehensive analysis of ZOPO for any fixed positive temperature, and prove convergence of ZOPPA under minimal assumptions on the objective function. We do so by demonstrating that ZOPPA can be interpreted as an exact proximal point method applied to an auxiliary smoothed objective, rather than an inexact method on the original function. Importantly, we further derive explicit guarantees connecting this smoothed problem back to the original objective and establish convergence results for the sampled method (S-ZOPPA) at a fixed temperature.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper analyzes zeroth-order proximal point algorithms for non-convex black-box optimization. It focuses on the high-temperature (fixed positive temperature) regime and proves convergence of ZOPPA by reinterpreting the algorithm as an exact proximal-point iteration on an auxiliary smoothed objective function rather than an inexact iteration on the original objective. Explicit error bounds or guarantees are derived that connect the smoothed problem back to the original objective, and convergence is established for the sampled variant S-ZOPPA under minimal assumptions (lower semicontinuity and coercivity sufficient for the proximal operator to be well-defined).

Significance. If the derivations hold, the work supplies a clean theoretical justification for using fixed-temperature ZOPPA without invoking vanishing-temperature limits, which are impractical under sampling. The exact-proximal reinterpretation on the auxiliary smoothed objective, together with the explicit links to the original function, strengthens the applicability of proximal-point theory to zeroth-order methods and directly addresses variance issues in Monte Carlo estimation at low temperatures.

minor comments (3)
  1. The abstract states that 'explicit guarantees connecting this smoothed problem back to the original objective' are derived, but the precise form of these guarantees (e.g., whether they are in terms of function values, distances, or expectations) is not previewed; adding one sentence would improve readability.
  2. Notation for the temperature parameter and the smoothing operator should be introduced once in §2 and used consistently thereafter; occasional redefinition of the auxiliary objective risks minor confusion for readers.
  3. The manuscript would benefit from a short remark in the introduction clarifying how the minimal assumptions differ from (or coincide with) those used in prior vanishing-temperature analyses.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive summary of our work on the convergence of zeroth-order proximal point algorithms at fixed positive temperature, their assessment of its significance, and the recommendation for minor revision. No specific major comments were raised in the report.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The derivation reinterprets ZOPPA at fixed positive temperature as an exact proximal-point iteration on an auxiliary smoothed objective, with explicit links back to the original function under standard lower-semicontinuity/coercivity assumptions. This step is independent of the target convergence result and does not reduce by construction to a fitted parameter, self-definition, or self-citation chain. The argument is self-contained against external proximal-analysis benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on standard domain assumptions from proximal analysis that are not derived in the paper; no free parameters or invented entities are introduced in the abstract.

axioms (1)
  • domain assumption The objective function satisfies the minimal assumptions required for the proximal operator to be well-defined at fixed positive temperature.
    Invoked when stating convergence under minimal assumptions; typical in the field but not proved here.

pith-pipeline@v0.9.0 · 5530 in / 1264 out tokens · 59305 ms · 2026-05-13T05:14:59.864051+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

68 extracted references · 68 canonical work pages

  1. [1]

    Agapiou, O

    S. Agapiou, O. Papaspiliopoulos, D. Sanz-Alonso, and A. M. Stuart. Importance sampling: Intrinsic dimension and computational cost.Statistical Science, pages 405–431, 2017

  2. [2]

    Attouch, J

    H. Attouch, J. Bolte, and B. F. Svaiter. Convergence of descent methods for semi-algebraic and tame problems: proximal algorithms, forward–backward splitting, and regularized Gauss–Seidel methods.Mathematical programming, 137(1):91–129, 2013

  3. [3]

    Bakry, I

    D. Bakry, I. Gentil, and M. Ledoux.Analysis and Geometry of Markov Diffusion Operators, volume 348. Springer, 2014

  4. [4]

    Beck.First-Order Methods in Optimization

    A. Beck.First-Order Methods in Optimization. Society for Industrial and Applied Mathematics, Philadelphia, PA, 2017

  5. [5]

    Beck and M

    A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems.SIAM J. Imaging Sci., 2:183–202, 2009

  6. [6]

    Bengtsson, P

    T. Bengtsson, P. Bickel, and B. Li. Curse-of-dimensionality revisited: Collapse of the particle filter in very large scale systems. InProbability and statistics: Essays in honor of David A. Freedman, volume 2, pages 316–334. Institute of Mathematical Statistics, 2008

  7. [7]

    Bolte, A

    J. Bolte, A. Daniilidis, and A. S. Lewis. The Łojasiewicz inequality for nonsmooth subanalytic functions with applications to subgradient dynamical systems.SIAM Journal on Optimization, 17(4):1205–1223, 2007

  8. [8]

    Boucheron, G

    S. Boucheron, G. Lugosi, and P. Massart.Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford University Press, 2013

  9. [9]

    Bredies, E

    K. Bredies, E. Chenchene, and E. Naldi. Graph and distributed extensions of the Dou- glas–Rachford method.SIAM Journal on Optimization, 34(2):1569–1594, 2024

  10. [10]

    Cattiaux and A

    P. Cattiaux and A. Guillin. Functional inequalities for perturbed measures with applications to log-concave measures and to some bayesian problems.Bernoulli, 28(4):2294–2321, 2022

  11. [11]

    fast iterative shrink- age/thresholding algorithm

    A. Chambolle and C. Dossal. On the convergence of the iterates of the “fast iterative shrink- age/thresholding algorithm”.Journal of Optimization Theory and Applications, 166:968–982, 2015

  12. [12]

    Chambolle and T

    A. Chambolle and T. Pock. A first-order primal-dual algorithm for convex problems with applications to imaging.J. Math. Imaging Vision, 40(1):120–145, 2011. 10

  13. [13]

    G. H.-G. Chen and R. T. Rockafellar. Convergence rates in forward–backward splitting.SIAM Journal on Optimization, 7(2):421–444, 1997

  14. [14]

    P.-Y . Chen, H. Zhang, Y . Sharma, J. Yi, and C.-J. Hsieh. Zoo: Zeroth order optimization based black-box attacks to deep neural networks without training substitute models. InProceedings of the 10th ACM Workshop on Artificial Intelligence and Security, pages 15–26. ACM, 2017

  15. [15]

    Chewi.Log-concave sampling

    S. Chewi.Log-concave sampling. Forthcoming, 2026. Available online at https:// chewisinho.github.io/

  16. [16]

    Choromanski, M

    K. Choromanski, M. Rowland, V . Sindhwani, R. Turner, and A. Weller. Structured evolution with compact architectures for scalable policy optimization. InProceedings of the 35th International Conference on Machine Learning, volume 80 ofProceedings of Machine Learning Research, pages 970–978, 2018

  17. [17]

    P. L. Combettes and J. I. Madariaga. Asymptotic analysis of an abstract stochastic scheme for solving monotone inclusions, 2026. arXiv:2512.03023

  18. [18]

    P. L. Combettes and J.-C. Pesquet. Stochastic quasi-Fejér block-coordinate fixed point iterations with random sweeping.SIAM Journal on Optimization, 25(2):1221–1248, 2015

  19. [19]

    P. L. Combettes and V . R. Wajs. Signal recovery by proximal forward-backward splitting. Multiscale modeling & simulation, 4(4):1168–1200, 2005

  20. [20]

    A. R. Conn, K. Scheinberg, and L. N. Vicente.Introduction to Derivative-Free Optimization, volume 8 ofMPS-SIAM Series on Optimization. SIAM, Philadelphia, PA, 2009

  21. [21]

    M. G. Crandall and P.-L. Lions. Two approximations of solutions of Hamilton-Jacobi equations. Mathematics of Computation, 43(167):1–19, 1984

  22. [22]

    Darbon, G

    J. Darbon, G. P. Langlois, and T. Meng. Connecting Hamilton-Jacobi partial differential equations with maximum a posteriori and posterior mean estimators for some non-convex priors. InHandbook of Mathematical Models and Algorithms in Computer Vision and Imaging: Mathematical Imaging and Vision, pages 1–25. 2021

  23. [23]

    Deligiannidis, P

    G. Deligiannidis, P. E. Jacob, E. M. Khribch, and G. Wang. On importance sampling and independent Metropolis-Hastings with an unbounded weight function, 2024. arXiv:2411.09514

  24. [24]

    N. Di, E. C. Chi, and S. W. Fung. Operator splitting with Hamilton-Jacobi-based proximals,

  25. [25]

    J. Dunn. Convexity, monotonicity, and gradient processes in Hilbert space.Journal of Mathe- matical Analysis and Applications, 53(1):145–158, 1976

  26. [26]

    L. C. Evans.Partial Differential Equations, volume 19 ofGraduate Studies in Mathematics. American Mathematical Society, Providence, RI, 2 edition, 2010

  27. [27]

    Flaxman, A

    A. Flaxman, A. T. Kalai, and B. McMahan. Online convex optimization in the bandit setting: Gradient descent without a gradient. InSODA ’05 Proceedings of the sixteenth annual ACM- SIAM symposium on Discrete algorithms, pages 385–394, 2005

  28. [28]

    G. B. Folland.Real Analysis: Modern Techniques and Their Applications. Pure and Applied Mathematics. Wiley-Interscience, New York, 2 edition, 1999

  29. [29]

    Fornasier, H

    M. Fornasier, H. Huang, J. Klemenc, and G. Malaspina. From consensus-based optimization to evolution strategies: Proof of global convergence, 2026. arXiv:2602.11677

  30. [30]

    Gribonval

    R. Gribonval. Should penalized least squares regression be interpreted as maximum a posteriori estimation?IEEE Transactions on Signal Processing, 59(5):2405–2410, 2011

  31. [31]

    priors” & “priors

    R. Gribonval and P. Machart. Reconciling “priors” & “priors” without prejudice? InAdvances in Neural Information Processing Systems, volume 26. Curran Associates, Inc., 2013

  32. [32]

    Gribonval and M

    R. Gribonval and M. Nikolova. A characterization of proximity operators.Journal of Mathe- matical Imaging and Vision, 62(6):773–789, 2020. 11

  33. [33]

    T. C. Hesterberg.Advances in importance sampling. Stanford University, 1988

  34. [34]

    Holley and D

    R. Holley and D. Stroock. Logarithmic sobolev inequalities and stochastic ising models.Journal of statistical physics, 46(5-6):1159–1194, 1987

  35. [35]

    Kish.Survey Sampling

    L. Kish.Survey Sampling. John Wiley & Sons, New York, 1965

  36. [36]

    A. Kong. A note on importance sampling using standardized weights.University of Chicago, Dept. of Statistics, Tech. Rep, 348:14, 1992

  37. [37]

    Larson, M

    J. Larson, M. Menickelly, and S. M. Wild. Derivative-free optimization methods.Acta Numerica, 28:287–404, 2019

  38. [38]

    Lauga and S

    G. Lauga and S. Vaiter. Characterizations of inexact proximal operators, 2026. arXiv:2602.02022

  39. [39]

    Lemieux.Monte Carlo and Quasi-Monte Carlo Sampling

    C. Lemieux.Monte Carlo and Quasi-Monte Carlo Sampling. Springer Series in Statistics. Springer, New York, NY , 2009

  40. [40]

    P. L. Lions and B. Mercier. Splitting algorithms for the sum of two nonlinear operators.SIAM Journal on Numerical Analysis, 16(6):964–979, 1979

  41. [41]

    J. S. Liu. Metropolized independent sampling with comparisons to rejection sampling and importance sampling.Statistics and computing, 6(2):113–119, 1996

  42. [42]

    Łojasiewicz

    S. Łojasiewicz. Une propriété topologique des sous-ensembles analytiques réels. InLes Équations aux Dérivées Partielles, pages 87–89, Paris, 1963. Éditions du Centre National de la Recherche Scientifique

  43. [43]

    X. Luo, X. Xu, and H. Rabitz. Asymptotic proximal point methods for global optimization. Journal of Scientific Computing, 105(3):87, 2025

  44. [44]

    Mania, A

    H. Mania, A. Guy, and B. Recht. Simple random search of static linear policies is competitive for reinforcement learning. NIPS’18. Curran Associates Inc., 2018

  45. [45]

    Martinet

    B. Martinet. Brève communication. régularisation d’inéquations variationnelles par approxima- tions successives.Revue française d’informatique et de recherche opérationnelle. Série rouge, 4(R3):154–158, 1970

  46. [46]

    Morales, P

    D. Morales, P. Pérez-Aros, and E. Vilches. Convergence rates for stochastic proximal and projection estimators, 2026. arXiv:2602.06750

  47. [47]

    J.-J. Moreau. Proximité et dualité dans un espace hilbertien.Bulletin de la Société Mathématique de France, 93:273–299, 1965

  48. [48]

    Nesterov and V

    Y . Nesterov and V . Spokoiny. Random gradient-free minimization of convex functions.Found. Comput. Math., 17(2):527–566, 2017

  49. [49]

    Osher, H

    S. Osher, H. Heaton, and S. Wu Fung. A Hamilton–Jacobi-based proximal operator.Proceedings of the National Academy of Sciences, 120(14):e2220469120, 2023

  50. [50]

    Raginsky, A

    M. Raginsky, A. Rakhlin, and M. Telgarsky. Non-convex learning via stochastic gradient Langevin dynamics: a nonasymptotic analysis. InProceedings of the 2017 Conference on Learning Theory, volume 65 ofProceedings of Machine Learning Research, pages 1674–1703. PMLR, 2017

  51. [51]

    Reed and B

    M. Reed and B. Simon.Fourier Analysis, Self-Adjointness, volume 2 ofMethods of Modern Mathematical Physics. Academic Press, New York, 1975

  52. [52]

    Riedl, T

    K. Riedl, T. Klock, C. Geldhauser, and M. Fornasier. How consensus-based optimization can be interpreted as a stochastic relaxation of gradient descent. In2nd Differentiable Almost Everything Workshop at the 41st International Conference on Machine Learning (ICML), Vienna, Austria, 2024. 12

  53. [53]

    R. T. Rockafellar. Monotone operators and the proximal point algorithm.SIAM journal on control and optimization, 14(5):877–898, 1976

  54. [54]

    R. Y . Rubinstein and D. P. Kroese.Simulation and the Monte Carlo method. John Wiley & Sons, 2016

  55. [55]

    Salimans, J

    T. Salimans, J. Ho, X. Chen, S. Sidor, and I. Sutskever. Evolution strategies as a scalable alternative to reinforcement learning, 2017. arXiv:1703.03864

  56. [56]

    Salzo and S

    S. Salzo and S. Villa. Inexact and accelerated proximal point algorithms.Journal of Convex Analysis, 19(4):1167–1192, 2012

  57. [57]

    Saumard and J

    A. Saumard and J. A. Wellner. Log-concavity and strong log-concavity: a review.Statistics Surveys, 8:45–114, 2014

  58. [58]

    Schmidt, N

    M. Schmidt, N. Le Roux, and F. Bach. Convergence rates of inexact proximal-gradient methods for convex optimization. InAdvances in Neural Information Processing Systems, volume 24, pages 1458–1466. Curran Associates, Inc., 2011

  59. [59]

    J. C. Spall.Introduction to Stochastic Search and Optimization. John Wiley & Sons, Inc., USA, 1 edition, 2003

  60. [60]

    R. J. Tibshirani, S. W. Fung, H. Heaton, and S. Osher. Laplace meets Moreau: Smooth approximation to infimal convolutions using Laplace’s method.Journal of Machine Learning Research, 26(72):1–36, 2025

  61. [61]

    M. J. Wainwright and M. I. Jordan. Graphical models, exponential families, and variational inference.Foundations and Trends in Machine Learning, 1(1–2):1–305, 2008

  62. [62]

    P. Xu, J. Chen, D. Zou, and Q. Gu. Global convergence of Langevin dynamics based algorithms for nonconvex optimization. InAdvances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018

  63. [63]

    X. Xu, Y . Sun, J. Liu, B. Wohlberg, and U. S. Kamilov. Provable convergence of Plug-and-Play priors with MMSE denoisers.IEEE Signal Processing Letters, 27:1280–1284, 2020

  64. [64]

    Zhang, F

    M. Zhang, F. Han, Y . T. Chow, S. Osher, and H. Schaeffer. Inexact proximal point algorithms for zeroth-order global optimization, 2024. arXiv:2412.11485

  65. [65]

    Zhang, P

    Y . Zhang, P. Li, J. Hong, J. Li, Y . Zhang, W. Zheng, P.-Y . Chen, J. D. Lee, W. Yin, M. Hong, Z. Wang, S. Liu, and T. Chen. Revisiting zeroth-order optimization for memory-efficient LLM fine-tuning: A benchmark. InForty-first International Conference on Machine Learning, 2024

  66. [66]

    Åkerman, E

    A. Åkerman, E. Chenchene, P. Giselsson, and E. Naldi. Splitting the forward-backward algorithm: A full characterization, 2025. arXiv:2504.10999. A Preliminaries Notation.To help the reader we summarize in Table 1 the notations used along the paper and in the appendix. A.1 Approximating the proximal operator with Hamilton-Jacobi Consider a functionf:R d →R...

  67. [67]

    The first result we give is partly a consequence of these works, but we also establish the surjectivity of the zeroth-order proximal operator zproxδ λ,f

    and later extended to non-white noise B in [31]. The first result we give is partly a consequence of these works, but we also establish the surjectivity of the zeroth-order proximal operator zproxδ λ,f . By considering strictly positive Gibbs-type measures of the form e−f /δ satisfying Assumption 2.1, we demonstrate that the operator is not merely smooth,...

  68. [68]

    With the following result we provide convergence of the iterates and the rate on the residuals

    Thus, fixed points are precisely stationary points off λ,δ (equivalently, stationary points ofZ λ,δ). With the following result we provide convergence of the iterates and the rate on the residuals. Theorem 10(Convergence of the iterates).Let f:R d →R satisfy Assumption 2.1. Then f λ,δ admits at least a minimizer, satisfies the KL property and the sequence...