arxiv: 2605.07006 · v1 · submitted 2026-05-07 · 🧮 math.OC

Recognition: 2 theorem links

· Lean Theorem

Lectures on optimization

Sinho Chewi

Pith reviewed 2026-05-11 00:49 UTC · model grok-4.3

classification 🧮 math.OC

keywords convex optimizationfirst-order methodsoptimization theorygradient descentconvex analysislecture notes

0 comments

The pith

These lecture notes present the theory of convex optimization emphasizing first-order methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper is a set of lecture notes that systematically cover the theory of convex optimization with a strong focus on first-order methods. A sympathetic reader would care because these methods provide efficient ways to solve optimization problems that arise in machine learning and other fields, without requiring expensive computations like Hessians. The notes aim to build both intuition and rigorous understanding through detailed explanations and proofs. If the notes succeed, readers will be able to analyze and apply first-order algorithms confidently.

Core claim

These lecture notes cover the theory of convex optimization, with a particular emphasis on first-order methods. The goal is to provide a clear and accessible presentation of the key theoretical results and algorithmic techniques in this domain.

What carries the argument

First-order methods, which iteratively update solutions using only gradient information to minimize convex objective functions.

If this is right

Readers acquire the ability to prove convergence rates for standard first-order algorithms on convex problems.
The notes demonstrate the scalability advantages of first-order methods for high-dimensional optimization tasks.
Key concepts in convex analysis are connected directly to practical algorithm design.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The notes could be extended to include numerical examples or code implementations to bridge theory and practice.
Similar lecture notes might be developed for non-convex optimization using related techniques.
Testing the notes would involve checking if readers can independently solve standard convex optimization problems after study.

Load-bearing premise

Readers possess sufficient background in mathematics including linear algebra and real analysis to follow the theoretical developments.

What would settle it

A reader with the required background is unable to understand or replicate the proof of convergence for the gradient descent method after studying the notes.

Figures

Figures reproduced from arXiv: 2605.07006 by Sinho Chewi.

**Figure 1.** Figure 1: We compare the gradient flow with AGF for 𝑓 (𝑥) := 1 2 (𝛼 𝑥 [1] 2 + 𝑥 [2] 2 ) and 𝛾 = 2 √ 𝛼, where 𝛼 = 1/16 and 𝑥0 = (1, 1). Note that the loss is not monotonic over time, hence the need for a different Lyapunov function ℒ𝑡 . Recall that under convexity and 𝛼-convexity, the objective gap 𝑓 (𝑥𝑡) − 𝑓★ for GF converges at the rates 𝑂(1/𝑡) and 𝑂(exp(−2𝛼𝑡)) respectively. On the other hand, for AGF, the converge… view at source ↗

**Figure 2.** Figure 2: We plot the iterates of PSD, with and without iterate averaging, for the function 𝑓 (𝑥) := |𝑥 [1] | + 2 |𝑥 [2] | with step size ℎ = 0.15. The analysis above shows that when the projection operator is cheap to compute, optimization under constraints is straightforward provided that we interleave the gradient steps with projection steps. We next tackle a more general setting in which we separate out the cons… view at source ↗

**Figure 3.** Figure 3: The soft thresholding operator. Therefore, it suffices to solve the problem in dimension one. A direct computation (see Exercise 8.1) then yields prox𝜆 |·|(𝑦) = (|𝑦| − 𝜆)+ sgn𝑦 =: thresh𝜆 (𝑦) where (·)+ := max{0, ·} denotes the positive part. The operator thresh𝜆, known as the soft thresholding operator ( [PITH_FULL_IMAGE:figures/full_fig_p063_3.png] view at source ↗

read the original abstract

These lecture notes cover the theory of convex optimization, with a particular emphasis on first-order methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

These are standard lecture notes on convex optimization emphasizing first-order methods, with no new results or claims.

read the letter

These lecture notes cover the theory of convex optimization with emphasis on first-order methods. They introduce nothing new in terms of theorems, derivations, or applications, which aligns with their stated purpose as educational material rather than research output. The central contribution is simply a structured presentation of established ideas. What the notes do well is lay out the material in a lecture-friendly sequence, which can help readers follow the logical flow from basic convex analysis through to practical first-order algorithms like gradient descent and its variants. If the full text includes clean examples or careful explanations of convergence rates, that could make the notes genuinely useful for someone learning the area. The soft spots are limited and predictable for this genre. There are no original claims to scrutinize, so the value rests entirely on how clearly and accurately the standard results are conveyed. The notes assume solid mathematical preparation, which is reasonable but narrows who will benefit. Any minor issues would likely be in the details of the exposition rather than conceptual errors, given that the underlying theory is mature and well-documented elsewhere. This material is aimed at graduate students, early-career researchers, or instructors who want a compact set of notes on convex optimization and first-order methods. It is not aimed at specialists hunting for advances in the field. I would suggest reading it if you or your students need a clear reference on the topic, but it does not warrant peer review as a research contribution since the work is explicitly expository and contains no novel elements to referee.

Referee Report

0 major / 1 minor

Summary. The manuscript consists of lecture notes on the theory of convex optimization, with a particular emphasis on first-order methods. It is purely expository and presents established material without advancing new theorems, derivations, or empirical results.

Significance. The covered topics are standard and well-established in the field of optimization. While clear lecture notes can aid teaching and reference, the absence of novel contributions, machine-checked proofs, or falsifiable predictions means the work has limited significance for advancing research in a journal setting.

minor comments (1)

The abstract is very brief and does not outline the specific topics, prerequisites, or structure of the notes (e.g., which first-order methods are covered in detail).

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for reviewing our manuscript on lecture notes for convex optimization. We address the points raised regarding its expository nature below.

read point-by-point responses

Referee: The manuscript consists of lecture notes on the theory of convex optimization, with a particular emphasis on first-order methods. It is purely expository and presents established material without advancing new theorems, derivations, or empirical results.

Authors: We agree that the work is expository and presents established material. The manuscript is intended as lecture notes to provide a structured, accessible presentation of convex optimization theory with emphasis on first-order methods, rather than to derive new results. revision: no
Referee: The covered topics are standard and well-established in the field of optimization. While clear lecture notes can aid teaching and reference, the absence of novel contributions, machine-checked proofs, or falsifiable predictions means the work has limited significance for advancing research in a journal setting.

Authors: We acknowledge that the topics covered are standard in the optimization literature. The notes aim to support teaching and self-study by offering clear explanations, but we do not claim research-level novelty or new empirical findings. revision: no

standing simulated objections not resolved

The manuscript contains no new theorems, derivations, or empirical results, as this is inherent to its purpose as expository lecture notes.

Circularity Check

0 steps flagged

No significant circularity in expository lecture notes

full rationale

The manuscript consists of lecture notes on established convex optimization theory with emphasis on first-order methods. It is purely expository and descriptive, presenting standard material without novel derivations, empirical fits, parameter estimations, or self-referential claims that could reduce to inputs by construction. No load-bearing equations, proofs, or predictions exist that require checking for circularity, so the score is zero with no steps identified.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No new content; standard convex optimization theory is assumed from prior literature with no free parameters, axioms, or invented entities introduced.

pith-pipeline@v0.9.0 · 5275 in / 761 out tokens · 38437 ms · 2026-05-11T00:49:26.677183+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
These lecture notes cover the theory of convex optimization, with a particular emphasis on first-order methods.
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear
Theorem 3.4 (convergence of GD in function value)... f(x_N) − f★ ≤ α/2{(1−αh)^−N −1} ∥x0 − x★∥²

Reference graph

Works this paper leans on

25 extracted references · 3 canonical work pages

[1]

Information- theoretic lower bounds on the oracle complexity of stochastic convex opti- mization

[Aga+12] A. Agarwal, P. L. Bartlett, P. Ravikumar, and M. J. Wainwright. “Information- theoretic lower bounds on the oracle complexity of stochastic convex opti- mization”. In:IEEE Trans. Inform. Theory58.5 (2012), pp. 3235–3249. [AHK12] S. Arora, E. Hazan, and S. Kale. “The multiplicative weights update method: a meta-algorithm and applications”. In:Theo...

2012
[2]

Convergence of coordinate ascent variational inference for log-concave measures via optimal transport

Curran Associates, Inc., 2022, pp. 17263–17275. [AL24] M. Arnese and D. Lacker. “Convergence of coordinate ascent variational inference for log-concave measures via optimal transport”. In:arXiv preprint 2404.08792(2024). [ALZ24] F. Ascolani, H. Lavenant, and G. Zanella. “Entropy contraction of the Gibbs sampler under log-concavity”. In:arXiv preprint 2410...

work page arXiv 2022
[3]

Near-linear time approxima- tion algorithms for optimal transport via Sinkhorn iteration

Trans- lations of Mathematical Monographs. Translated from the 1993 Japanese original by Daishi Harada. American Mathematical Society, Providence, RI; Oxford University Press, Oxford, 2000, pp. x+206. [ANR17] J. M. Altschuler, J. Niles-Weed, and P. Rigollet. “Near-linear time approxima- tion algorithms for optimal transport via Sinkhorn iteration”. In:Adv...

1993
[4]

Acceleration by stepsize hedging: multi-step descent and the silver stepsize schedule

Ed. by I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett. Curran Associates, Inc., 2017, pp. 1964–1974. [AP24a] J. M. Altschuler and P. A. Parrilo. “Acceleration by stepsize hedging: multi-step descent and the silver stepsize schedule”. In:J. ACM(Dec. 2024). [AP24b] J. M. Altschuler and P. A. Parrilo. “Acceleration...

2017
[5]

Breaking the curse of dimensionality with convex neural networks

Proceedings of Machine Learning Research. New York, New York, USA: PMLR, June 2016, pp. 1080–1089. 143 [Bac+92] F. L. Baccelli, G. Cohen, G. J. Olsder, and J.-P. Quadrat.Synchronization and linearity. Wiley Series in Probability and Mathematical Statistics: Probability and Mathematical Statistics. An algebra for discrete event systems. John Wiley & Sons, ...

2016
[6]

Universal approximation bounds for superpositions of a sig- moidal function

[Bar93] A. R. Barron. “Universal approximation bounds for superpositions of a sig- moidal function”. In:IEEE Trans. Inform. Theory39.3 (1993), pp. 930–945. [BBT17] H. H. Bauschke, J. Bolte, and M. Teboulle. “A descent lemma beyond Lipschitz gradient continuity: first-order methods revisited and applications”. In:Math. Oper. Res.42.2 (2017), pp. 330–348. [...

1993
[7]

A nonlinear programming algorithm for solving semidefinite programs via low-rank factorization

MOS-SIAM Series on Op- timization. Society for Industrial and Applied Mathematics (SIAM), Philadel- phia, PA; Mathematical Optimization Society, Philadelphia, PA, 2017, pp. xii+475. [Bil95] P. Billingsley.Probability and measure. Third. Wiley Series in Probability and Mathematical Statistics. A Wiley-Interscience Publication. John Wiley & Sons, Inc., New ...

2017
[8]

Local minima and convergence in low-rank semidefinite programming

Com- putational semidefinite and second order cone programming: the state of the art. 2003, pp. 329–357. [BM05] S. Burer and R. D. C. Monteiro. “Local minima and convergence in low-rank semidefinite programming”. In:Math. Program.103.3 (2005), pp. 427–444. [Br`e67] L. M. Br`egman. “A relaxation method of finding a common point of convex sets and its appli...

2003
[9]

The entropic barrier is𝑛-self-concordant

Proceedings of Machine Learning Research. PMLR, July 2022, pp. 2984–3014. [Che23] S. Chewi. “The entropic barrier is𝑛-self-concordant”. In:Geometric Aspects of Functional Analysis: Israel Seminar (GAFA) 2020-2022. Ed. by R. Eldan, B. Klartag, A. Litvak, and E. Milman. Cham: Springer International Publishing, 2023, pp. 209–222. [Che26] S. Chewi.Log-concave...

2022
[10]

Fast convergence of the expectation-maximization algorithm under a logarithmic Sobolev inequality

[CJ25] R. Caprio and A. M. Johansen. “Fast convergence of the expectation-maximization algorithm under a logarithmic Sobolev inequality”. In:Biometrika112.4 (Aug. 2025), asaf061. [CLD25] C. Cheng, D. Levy, and J. C. Duchi. “Geometry, computation, and optimality in stochastic optimization”. In:arXiv preprint 1909.10455(2025). [CLM24] H. Chardon, M. Lerasle...

work page arXiv 2025
[11]

An algorithm for quadratic programming

Graduate Stud- ies in Mathematics. American Mathematical Society, Providence, RI, 2010, pp. xxii+749. [FW56] M. Frank and P. Wolfe. “An algorithm for quadratic programming”. In:Naval Res. Logist. Quart.3 (1956), pp. 95–110. 145 [GWS21] S. Gunasekar, B. Woodworth, and N. Srebro. “Mirrorless mirror descent: a natural derivation of mirror descent”. In:Procee...

2010
[12]

Methods of conjugate gradients for solving linear systems

Proceedings of Machine Learning Research. PMLR, Apr. 2021, pp. 2305–2313. [HS52] M. R. Hestenes and E. Stiefel. “Methods of conjugate gradients for solving linear systems”. In:J. Research Nat. Bur. Standards49 (1952), pp. 409–436. [JZ13] R. Johnson and T. Zhang. “Accelerating stochastic gradient descent using predictive variance reduction”. In:Advances in...

2021
[13]

A new polynomial-time algorithm for linear programming

[Kar84] N. K. Karmarkar. “A new polynomial-time algorithm for linear programming”. In:Combinatorica4.4 (1984), pp. 373–395. [KNS16] H. Karimi, J. Nutini, and M. Schmidt. “Linear convergence of gradient and proximal-gradient methods under the Polyak–Łojasiewicz condition”. In:Euro- pean Conference on Machine Learning and Knowledge Discovery in Databases— Volume

1984
[14]

A gradient descent perspective on Sinkhorn

Riva del Garda, Italy: Springer-Verlag, 2016, pp. 795–811. [L´eg21] F. L´eger. “A gradient descent perspective on Sinkhorn”. In:Appl. Math. Optim. 84.2 (2021), pp. 1843–1855. [LFN18] H. Lu, R. M. Freund, and Y. Nesterov. “Relatively smooth convex optimization by first-order methods, and applications”. In:SIAM J. Optim.28.1 (2018), pp. 333–354. [LMW24] J. ...

work page arXiv 2016
[15]

A method for solving the convex programming problem with convergence rate 𝑂( 1/𝑘2)

Springer Optimization and Its Applications. Springer, 2018, pp. xxiii+589. [Nes83] Y. E. Nesterov. “A method for solving the convex programming problem with convergence rate 𝑂( 1/𝑘2)”. In:Dokl. Akad. Nauk SSSR269.3 (1983), pp. 543–

2018
[16]

Generalization of an inequality by Talagrand and links with the logarithmic Sobolev inequality

SIAM Studies in Applied Mathematics. Soci- ety for Industrial and Applied Mathematics (SIAM), Philadelphia, PA, 1994, pp. x+405. [NY83] A. S. Nemirovski and D. B. Yudin.Problem complexity and method efficiency in optimization. Wiley-Interscience Series in Discrete Mathematics. Translated from the Russian and with a preface by E. R. Dawson. John Wiley & So...

1994
[17]

Acceleration of stochastic approximation by averaging

[PJ92] B. T. Polyak and A. B. Juditsky. “Acceleration of stochastic approximation by averaging”. In:SIAM J. Control Optim.30.4 (1992), pp. 838–855. [Pol63] B. T. Polyak. “Gradient methods for minimizing functionals”. In: ˇZ. Vyˇcisl. Mat i Mat. Fiz.3 (1963), pp. 643–653. [RM51] H. Robbins and S. Monro. “A stochastic approximation method”. In:Ann. Math. St...

1992
[18]

A differential equation for modeling Nes- terov’s accelerated gradient method: theory and insights

Progress in Nonlinear Differential Equations and their Applications. Calculus of varia- tions, PDEs, and modeling. Birkh¨auser/Springer, Cham, 2015, pp. xxvii+353. [SBC16] W. Su, S. Boyd, and E. J. Cand`es. “A differential equation for modeling Nes- terov’s accelerated gradient method: theory and insights”. In:J. Mach. Learn. Res.17 (2016), Paper No. 153,

2015
[19]

A relationship between arbitrary positive matrices and doubly stochastic matrices

Proceedings of Machine Learning Research. PMLR, Nov. 2018, pp. 815–830. [Sin64] R. Sinkhorn. “A relationship between arbitrary positive matrices and doubly stochastic matrices”. In:Ann. Math. Statist.35 (1964), pp. 876–879. [Vaa98] A. W. van der Vaart.Asymptotic statistics. Vol

2018
[20]

Cambridge University Press, Cambridge, 1998, pp

Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, Cambridge, 1998, pp. xvi+443. [Ver18] R. Vershynin.High-dimensional probability. Vol

1998
[21]

An introduction with applications in data science, With a foreword by Sara van de Geer

Cambridge Series in Sta- tistical and Probabilistic Mathematics. An introduction with applications in data science, With a foreword by Sara van de Geer. Cambridge University Press, Cambridge, 2018, pp. xiv+284. [Vil03] C. Villani.Topics in optimal transportation. Vol

2018
[22]

American Mathematical Society, Providence, RI, 2003, pp

Graduate Studies in Math- ematics. American Mathematical Society, Providence, RI, 2003, pp. xvi+370. [Vil09] C. Villani.Optimal transport. Vol

2003
[23]

𝐿𝑥=𝑏 Laplacian solvers and their algorithmic applications

Grundlehren der Mathematischen Wissenschaften [Fundamental Principles of Mathematical Sciences]. Old and new. Springer-Verlag, Berlin, 2009, pp. xxii+973. [Vis12] N. K. Vishnoi. “𝐿𝑥=𝑏 Laplacian solvers and their algorithmic applications”. In:Found. Trends Theor. Comput. Sci.8.1-2 (2012), front matter, 1–141. [Wai19] M. J. Wainwright.High-dimensional stati...

2009
[24]

Tight complexity bounds for optimizing composite objectives

Cambridge Series in Statistical and Probabilistic Mathematics. A non-asymptotic viewpoint. Cam- bridge University Press, Cambridge, 2019, pp. xvii+552. 148 [WS16] B. E. Woodworth and N. Srebro. “Tight complexity bounds for optimizing composite objectives”. In:Advances in Neural Information Processing Systems. Ed. by D. Lee, M. Sugiyama, U. Luxburg, I. Guy...

2019
[25]

A variational perspective on accelerated methods in optimization

[WWJ16] A. Wibisono, A. C. Wilson, and M. I. Jordan. “A variational perspective on accelerated methods in optimization”. In:Proc. Natl. Acad. Sci. USA113.47 (2016), E7351–E7358. 149

2016