How AI settled the complexity of the oldest SGD algorithm

Micha{\l} Derezi\'nski; Xiaoyu Dong

arxiv: 2606.29593 · v1 · pith:ADHTHG7Hnew · submitted 2026-06-28 · 💻 cs.LG · cs.AI· cs.NA· math.NA· math.OC· stat.ML

How AI settled the complexity of the oldest SGD algorithm

Micha{\l} Derezi\'nski , Xiaoyu Dong This is my paper

Pith reviewed 2026-06-30 07:17 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.NAmath.NAmath.OCstat.ML

keywords Kaczmarz algorithmstochastic gradient descentworst-case complexitylinear systemsconvergence analysisiterative methodsAI-assisted discovery

0 comments

The pith

AI models have discovered the worst-case complexity of the Kaczmarz algorithm, the earliest form of stochastic gradient descent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper recounts how artificial intelligence systems collaborated to resolve the long-standing question of the exact worst-case complexity of the Kaczmarz algorithm first proposed in 1937. This method is recognized as the original stochastic gradient descent procedure and forms a foundational component of the optimization techniques used to train contemporary large language models. A sympathetic reader would care because the result supplies the tightest known performance guarantees for this basic iterative solver of linear systems and illustrates how AI can contribute to settling open theoretical questions in optimization. The narrative frames the discovery as a joint effort among models that produced the complexity bound without prior human derivation.

Core claim

The paper states that AI models working together have determined the precise worst-case complexity of the Kaczmarz algorithm, an iterative procedure for solving linear equations that selects equations at random or in sequence, thereby establishing its fundamental convergence rate that had remained unresolved since its introduction.

What carries the argument

The Kaczmarz algorithm itself, an iterative row-selection method for linear systems that serves as the earliest instance of stochastic gradient descent.

If this is right

The result supplies tight convergence guarantees for the foundational SGD variant used in linear solvers.
It positions the Kaczmarz method as the historical root whose complexity now serves as a reference point for later SGD variants.
The discovery process demonstrates that AI collaboration can resolve open questions in the analysis of iterative algorithms.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same AI-assisted approach could be applied to derive tight bounds for other randomized iterative methods whose complexity remains open.
It raises the possibility that theoretical computer science may increasingly rely on automated assistance for identifying extremal examples and proving bounds.
Connections appear between this linear-algebra setting and the analysis of convergence rates in modern non-convex optimization.

Load-bearing premise

The assumption that the complexity bound obtained with AI assistance is both correct and had not been established before.

What would settle it

A published proof of the identical complexity bound that predates the AI-assisted derivation, or a concrete linear system on which the algorithm converges more slowly than the claimed bound.

read the original abstract

In 1937, Stefan Kaczmarz proposed a simple algorithm for solving systems of linear equations. This algorithm turned out to be the earliest known example of stochastic gradient descent, a ubiquitous computing paradigm that drives the training of modern AI models such as ChatGPT and Gemini. Now, those AI models have joined forces to discover the worst-case complexity of the Kaczmarz algorithm. This paper tells the story of how it happened.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a narrative about AI finding Kaczmarz complexity but gives neither the bound nor any checkable derivation or novelty argument.

read the letter

The main takeaway is that the paper frames itself as the story of AI models discovering the worst-case complexity of the Kaczmarz algorithm, which it positions as the oldest SGD method. It does not present the actual complexity result, a proof sketch, or a comparison against existing analyses that use the smallest singular value or row norms.

What stands out is the attempt to document an AI-assisted discovery process in optimization. The link between Kaczmarz and SGD is already in the literature the abstract cites, so that part adds no new technical content. The narrative angle might interest readers who track how large language models are applied to classical math questions.

The soft spots are central rather than minor. Without the bound itself or any derivation, there is no way to verify correctness, independence from prior results, or whether the claim of settling the question holds. The abstract supplies only the story, and the stress-test note correctly flags that the central assertion rests on an unreported process. This leaves the soundness and circularity axes untestable from the given material.

The paper is aimed at readers curious about AI in mathematical research rather than specialists seeking new theorems or reproducible derivations. It does not show the kind of formal grounding or external verifiability that would normally justify referee time. I would not bring it to a reading group focused on optimization or learning theory.

Recommendation: do not send to peer review in this form. It would need the actual complexity statement, verification steps, and comparison to existing bounds before it could be evaluated as a research contribution.

Referee Report

2 major / 0 minor

Summary. The manuscript narrates how AI models collaborated to discover the worst-case complexity of the Kaczmarz algorithm (the earliest known instance of SGD) and presents this as a story of AI-assisted discovery in optimization.

Significance. If the claimed discovery were both correct and previously unknown, it would be notable for resolving the complexity of a foundational 1937 algorithm using modern AI tools. However, the manuscript supplies neither the bound, a derivation, nor verification details, so the significance cannot be assessed from the provided text.

major comments (2)

[Abstract] Abstract: the central claim that AI models 'have joined forces to discover the worst-case complexity' is unsupported because the manuscript states neither the bound itself nor any derivation, proof sketch, or comparison to prior Kaczmarz analyses (e.g., rates involving the smallest singular value or normalized row norms).
The manuscript is framed as a process narrative, yet the load-bearing assertion of a new, correct, and novel complexity result requires at minimum the explicit bound and a verification argument; their absence renders the claim unevaluable.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. The manuscript is structured as a narrative documenting an AI-assisted discovery process for the Kaczmarz algorithm's complexity. We agree that the central claim requires explicit technical support to be fully evaluable and will revise accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that AI models 'have joined forces to discover the worst-case complexity' is unsupported because the manuscript states neither the bound itself nor any derivation, proof sketch, or comparison to prior Kaczmarz analyses (e.g., rates involving the smallest singular value or normalized row norms).

Authors: We agree that the abstract's claim would benefit from explicit support. The manuscript prioritizes the discovery narrative over technical exposition, but this leaves the result unevaluable as noted. In revision we will add the explicit worst-case bound, a brief derivation sketch, and comparison to prior analyses (e.g., those based on singular values or row norms). revision: yes
Referee: [—] The manuscript is framed as a process narrative, yet the load-bearing assertion of a new, correct, and novel complexity result requires at minimum the explicit bound and a verification argument; their absence renders the claim unevaluable.

Authors: The observation is correct: a narrative framing alone cannot substantiate a novel complexity result. We will revise by incorporating the bound and a verification argument (e.g., in a new section or appendix) while preserving the process-oriented structure. revision: yes

Circularity Check

0 steps flagged

Narrative paper presents no derivation chain; no load-bearing steps to inspect

full rationale

The manuscript is a process narrative whose abstract and described content state that AI models discovered the Kaczmarz worst-case complexity but supply neither the bound, any equations, nor a derivation. With no claimed first-principles derivation or mathematical steps present, none of the enumerated circularity patterns (self-definitional, fitted-input prediction, self-citation load-bearing, etc.) can apply. The paper does not attempt to derive the result internally, so the analysis finds no reduction of outputs to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Insufficient information in the abstract to enumerate free parameters, axioms, or invented entities; the paper appears to be a narrative account rather than a technical derivation.

pith-pipeline@v0.9.1-grok · 5606 in / 962 out tokens · 20610 ms · 2026-06-30T07:17:01.862255+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

46 extracted references · 6 canonical work pages · 2 internal anchors

[1]

First proof.arXiv preprint arXiv:2602.05192, 2026

Mohammed Abouzaid, Andrew J Blumberg, Martin Hairer, Joe Kileel, Tamara G Kolda, Paul D Nelson, Daniel Spielman, Nikhil Srivastava, Rachel Ward, Shmuel Weinberger, et al. First proof.arXiv preprint arXiv:2602.05192, 2026

work page arXiv 2026
[2]

McGraw-Hill New York, 1960

Lars Valerian Ahlfors.Complex analysis: an introduction to the theory of analytic functions of one complex variable. McGraw-Hill New York, 1960

1960
[3]

Forbiddensidonsubsetsofperfectdifferencesets, featuring a human-assisted proof.Proceedings of the National Academy of Sciences, 123(21):e2531760123, 2026

BorisAlexeevandDustinGMixon. Forbiddensidonsubsetsofperfectdifferencesets, featuring a human-assisted proof.Proceedings of the National Academy of Sciences, 123(21):e2531760123, 2026

2026
[4]

Remarks on the disproof of the unit distance conjecture

Noga Alon, Thomas F Bloom, WT Gowers, Daniel Litt, Will Sawin, Arul Shankar, Jacob Tsimerman, Victor Wang, and Melanie Matchett Wood. Remarks on the disproof of the unit distance conjecture.arXiv preprint arXiv:2605.20695, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[5]

The importance of better models in stochastic opti- mization.Proceedings of the National Academy of Sciences, 116(46):22924–22930, 2019

Hilal Asi and John C Duchi. The importance of better models in stochastic opti- mization.Proceedings of the National Academy of Sciences, 116(46):22924–22930, 2019

2019
[6]

Fast last-iterate convergence of sgd in the smooth interpolation regime.Advances in Neural Infor- mation Processing Systems, 38:104951–104987, 2025

Amit Attia, Matan Schliserman, Uri Sherman, and Tomer Koren. Fast last-iterate convergence of sgd in the smooth interpolation regime.Advances in Neural Infor- mation Processing Systems, 38:104951–104987, 2025

2025
[7]

Non-strongly-convex smooth stochastic approxi- mation with convergence rate o (1/n).Advances in neural information processing systems, 26, 2013

Francis Bach and Eric Moulines. Non-strongly-convex smooth stochastic approxi- mation with convergence rate o (1/n).Advances in neural information processing systems, 26, 2013

2013
[8]

Tight nonparametric conver- gence rates for stochastic gradient descent under the noiseless linear model.Advances in Neural Information Processing Systems, 33:2576–2586, 2020

Raphaël Berthier, Francis Bach, and Pierre Gaillard. Tight nonparametric conver- gence rates for stochastic gradient descent under the noiseless linear model.Advances in Neural Information Processing Systems, 33:2576–2586, 2020. 18

2020
[9]

DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

Xiao Bi, Deli Chen, Guanting Chen, Shanhuang Chen, Damai Dai, Chengqi Deng, Honghui Ding, Kai Dong, Qiushi Du, Zhe Fu, et al. Deepseek llm: Scaling open- source language models with longtermism.arXiv preprint arXiv:2401.02954, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

Analyticity and discrete maximal regularity on lp-spaces.Journal of Functional Analysis, 183(1):211–230, 2001

Sönke Blunck. Analyticity and discrete maximal regularity on lp-spaces.Journal of Functional Analysis, 183(1):211–230, 2001

2001
[11]

Optimization methods for large- scale machine learning.SIAM review, 60(2):223–311, 2018

Léon Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for large- scale machine learning.SIAM review, 60(2):223–311, 2018

2018
[12]

Language models are few-shot learners.Advances in neural information pro- cessing systems, 33:1877–1901, 2020

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Pra- fulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information pro- cessing systems, 33:1877–1901, 2020

1901
[13]

Last-iterate convergence of randomized kacz- marz and sgd with greedy step size.Conference on Learning Theory, 2026

Michał Dereziński and Xiaoyu Dong. Last-iterate convergence of randomized kacz- marz and sgd with greedy step size.Conference on Learning Theory, 2026

2026
[14]

Random- ized kaczmarz methods with beyond-krylov convergence.SIAM Journal on Matrix Analysis and Applications, 46(4):2558–2588, 2025

Michał Dereziński, Deanna Needell, Elizaveta Rebrova, and Jiaming Yang. Random- ized kaczmarz methods with beyond-krylov convergence.SIAM Journal on Matrix Analysis and Applications, 46(4):2558–2588, 2025

2025
[15]

Introductory lectures on stochastic optimization.The mathematics of data, 25:99–186, 2018

John C Duchi. Introductory lectures on stochastic optimization.The mathematics of data, 25:99–186, 2018

2018
[16]

From continual learning to sgd and back: Better rates for continual linear models

Itay Evron, Ran Levinstein, Matan Schliserman, Uri Sherman, Tomer Koren, Daniel Soudry, and Nathan Srebro. From continual learning to sgd and back: Better rates for continual linear models. InFourth Conference on Lifelong Learning Agents- Workshop Track, 2025

2025
[17]

Handbook of convergence theorems for (stochastic) gradient methods

Guillaume Garrigos and Robert M Gower. Handbook of convergence theorems for (stochastic) gradient methods.arXiv preprint arXiv:2301.11235, 2023

work page arXiv 2023
[18]

The step decay schedule: A near optimal, geometrically decaying learning rate procedure for least squares.Advances in neural information processing systems, 32, 2019

Rong Ge, Sham M Kakade, Rahul Kidambi, and Praneeth Netrapalli. The step decay schedule: A near optimal, geometrically decaying learning rate procedure for least squares.Advances in neural information processing systems, 32, 2019

2019
[19]

Randomized iterative methods for linear systems.SIAM Journal on Matrix Analysis and Applications, 36(4):1660–1690, 2015

Robert M Gower and Peter Richtárik. Randomized iterative methods for linear systems.SIAM Journal on Matrix Analysis and Applications, 36(4):1660–1690, 2015

2015
[20]

Tight analyses for non-smooth stochastic gradient descent

Nicholas JA Harvey, Christopher Liaw, Yaniv Plan, and Sikander Randhawa. Tight analyses for non-smooth stochastic gradient descent. InConference on Learning Theory, pages 1579–1613. PMLR, 2019

2019
[21]

Making the last iterate of sgd information theoretically optimal

Prateek Jain, Dheeraj Nagaraj, and Praneeth Netrapalli. Making the last iterate of sgd information theoretically optimal. InConference on Learning Theory, pages 1752–1755. PMLR, 2019

2019
[22]

arXiv preprint arXiv:2510.23513 (2025)

Uijeong Jang and Ernest K Ryu. Point convergence of nesterov’s accelerated gradient method: An ai-assisted proof.arXiv preprint arXiv:2510.23513, 2025. 19

work page arXiv 2025
[23]

Société mathématique de France, 2006

Marius Junge, Christian Le Merdy, and Quanhua Xu.H∞ functional calculus and square functions on noncommutative Lp-spaces. Société mathématique de France, 2006

2006
[24]

M. S. Kaczmarz. Angenaherte auflosung von systemen linearer gleichungen.Bulletin International de l’Academie Polonaise des Sciences et des Lettres, 35:355–357, 1937

1937
[25]

Adam: A method for stochastic optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. International Conference on Learning Representations, 2015

2015
[26]

Efficient accelerated coordinate descent methods and faster algorithms for solving linear systems

Yin Tat Lee and Aaron Sidford. Efficient accelerated coordinate descent methods and faster algorithms for solving linear systems. In2013 ieee 54th annual symposium on foundations of computer science, pages 147–156. IEEE, 2013

2013
[27]

Randomized methods for linear constraints: convergence rates and conditioning.Mathematics of Operations Research, 35(3):641– 654, 2010

Dennis Leventhal and Adrian S Lewis. Randomized methods for linear constraints: convergence rates and conditioning.Mathematics of Operations Research, 35(3):641– 654, 2010

2010
[28]

Aiming towards the minimizers: fast convergence of sgd for overparametrized prob- lems.Advances in neural information processing systems, 36:60748–60767, 2023

Chaoyue Liu, Dmitriy Drusvyatskiy, Misha Belkin, Damek Davis, and Yian Ma. Aiming towards the minimizers: fast convergence of sgd for overparametrized prob- lems.Advances in neural information processing systems, 36:60748–60767, 2023

2023
[29]

An accelerated randomized kaczmarz algorithm.Math- ematics of Computation, 85(297):153–178, 2016

Ji Liu and Stephen Wright. An accelerated randomized kaczmarz algorithm.Math- ematics of Computation, 85(297):153–178, 2016

2016
[30]

Revisiting the last-iterate convergence of stochastic gradient methods.The Twelfth International Conference on Learning Representa- tions, 2023

Zijian Liu and Zhengyuan Zhou. Revisiting the last-iterate convergence of stochastic gradient methods.The Twelfth International Conference on Learning Representa- tions, 2023

2023
[31]

Stochastic gradient descent, weighted sampling, and the randomized kaczmarz algorithm.Advances in neural information processing systems, 27, 2014

Deanna Needell, Nathan Srebro, and Rachel Ward. Stochastic gradient descent, weighted sampling, and the randomized kaczmarz algorithm.Advances in neural information processing systems, 27, 2014

2014
[32]

Paved with good intentions: analysis of a ran- domized block kaczmarz method.Linear Algebra and its Applications, 441:199–221, 2014

Deanna Needell and Joel A Tropp. Paved with good intentions: analysis of a ran- domized block kaczmarz method.Linear Algebra and its Applications, 441:199–221, 2014

2014
[33]

Planar point sets with many unit distances

OpenAI. Planar point sets with many unit distances. 2026. Manuscript avail- able at https://cdn.openai.com/pdf/74c24085-19b0-4534-9c90-465b8e29ad73/unit- distance-proof.pdf

2026
[34]

A stochastic approximation method.The annals of mathematical statistics, pages 400–407, 1951

Herbert Robbins and Sutton Monro. A stochastic approximation method.The annals of mathematical statistics, pages 400–407, 1951

1951
[35]

SIAM, 2003

Yousef Saad.Iterative methods for sparse linear systems. SIAM, 2003

2003
[36]

Open problem: Is averaging needed for strongly convex stochastic gradient descent? InConference on Learning Theory, pages 47–1

Ohad Shamir. Open problem: Is averaging needed for strongly convex stochastic gradient descent? InConference on Learning Theory, pages 47–1. JMLR Workshop and Conference Proceedings, 2012

2012
[37]

Stochastic gradient descent for non-smooth op- timization: Convergence results and optimal averaging schemes

Ohad Shamir and Tong Zhang. Stochastic gradient descent for non-smooth op- timization: Convergence results and optimal averaging schemes. InInternational conference on machine learning, pages 71–79. PMLR, 2013. 20

2013
[38]

OUP Oxford, 1999

Kangshen Shen, John N Crossley, and Anthony Wah-Cheung Lun.The nine chapters on the mathematical art: Companion and commentary. OUP Oxford, 1999

1999
[39]

Approximate solutions of linear systems at a universal rate

Stefan Steinerberger. Approximate solutions of linear systems at a universal rate. SIAM Journal on Matrix Analysis and Applications, 44(3):1436–1446, 2023

2023
[40]

A randomized kaczmarz algorithm with exponential convergence.Journal of Fourier Analysis and Applications, 15(2):262– 278, 2009

Thomas Strohmer and Roman Vershynin. A randomized kaczmarz algorithm with exponential convergence.Journal of Fourier Analysis and Applications, 15(2):262– 278, 2009

2009
[41]

Last iterate convergence of sgd for least-squares in the interpolation regime.Advances in Neural Information Processing Systems, 34:21581–21591, 2021

Aditya Vardhan Varre, Loucas Pillaud-Vivien, and Nicolas Flammarion. Last iterate convergence of sgd for least-squares in the interpolation regime.Advances in Neural Information Processing Systems, 34:21581–21591, 2021

2021
[42]

Fast and faster convergence of sgd for over-parameterized models and an accelerated perceptron

Sharan Vaswani, Francis Bach, and Mark Schmidt. Fast and faster convergence of sgd for over-parameterized models and an accelerated perceptron. InThe 22nd international conference on artificial intelligence and statistics, pages 1195–1204. PMLR, 2019

2019
[43]

Acceler- ating scientific research with gemini: Case studies and common techniques.arXiv preprint arXiv:2602.03837, 2026

David P Woodruff, Vincent Cohen-Addad, Lalit Jain, Jieming Mao, Song Zuo, Mo- hammadHossein Bateni, Simina Branzei, Michael P Brenner, Lin Chen, Ying Feng, et al. Accelerating scientific research with gemini: Case studies and common tech- niques.arXiv preprint arXiv:2602.03837, 2026

work page arXiv 2026
[44]

Last iterate risk bounds of sgd with decaying stepsize for overparameterized linear regression

Jingfeng Wu, Difan Zou, Vladimir Braverman, Quanquan Gu, and Sham Kakade. Last iterate risk bounds of sgd with decaying stepsize for overparameterized linear regression. InInternational conference on machine learning, pages 24280–24314. PMLR, 2022

2022
[45]

Exact convergence rate of the last iterate in subgradient methods.SIAM Journal on Optimization, 35(3):2182–2201, 2025

Moslem Zamani and Francois Glineur. Exact convergence rate of the last iterate in subgradient methods.SIAM Journal on Optimization, 35(3):2182–2201, 2025

2025
[46]

Randomized extended kaczmarz for solv- ingleastsquares.SIAM Journal on Matrix Analysis and Applications, 34(2):773–793, 2013

Anastasios Zouzias and Nikolaos M Freris. Randomized extended kaczmarz for solv- ingleastsquares.SIAM Journal on Matrix Analysis and Applications, 34(2):773–793, 2013. 21

2013

[1] [1]

First proof.arXiv preprint arXiv:2602.05192, 2026

Mohammed Abouzaid, Andrew J Blumberg, Martin Hairer, Joe Kileel, Tamara G Kolda, Paul D Nelson, Daniel Spielman, Nikhil Srivastava, Rachel Ward, Shmuel Weinberger, et al. First proof.arXiv preprint arXiv:2602.05192, 2026

work page arXiv 2026

[2] [2]

McGraw-Hill New York, 1960

Lars Valerian Ahlfors.Complex analysis: an introduction to the theory of analytic functions of one complex variable. McGraw-Hill New York, 1960

1960

[3] [3]

Forbiddensidonsubsetsofperfectdifferencesets, featuring a human-assisted proof.Proceedings of the National Academy of Sciences, 123(21):e2531760123, 2026

BorisAlexeevandDustinGMixon. Forbiddensidonsubsetsofperfectdifferencesets, featuring a human-assisted proof.Proceedings of the National Academy of Sciences, 123(21):e2531760123, 2026

2026

[4] [4]

Remarks on the disproof of the unit distance conjecture

Noga Alon, Thomas F Bloom, WT Gowers, Daniel Litt, Will Sawin, Arul Shankar, Jacob Tsimerman, Victor Wang, and Melanie Matchett Wood. Remarks on the disproof of the unit distance conjecture.arXiv preprint arXiv:2605.20695, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[5] [5]

The importance of better models in stochastic opti- mization.Proceedings of the National Academy of Sciences, 116(46):22924–22930, 2019

Hilal Asi and John C Duchi. The importance of better models in stochastic opti- mization.Proceedings of the National Academy of Sciences, 116(46):22924–22930, 2019

2019

[6] [6]

Fast last-iterate convergence of sgd in the smooth interpolation regime.Advances in Neural Infor- mation Processing Systems, 38:104951–104987, 2025

Amit Attia, Matan Schliserman, Uri Sherman, and Tomer Koren. Fast last-iterate convergence of sgd in the smooth interpolation regime.Advances in Neural Infor- mation Processing Systems, 38:104951–104987, 2025

2025

[7] [7]

Non-strongly-convex smooth stochastic approxi- mation with convergence rate o (1/n).Advances in neural information processing systems, 26, 2013

Francis Bach and Eric Moulines. Non-strongly-convex smooth stochastic approxi- mation with convergence rate o (1/n).Advances in neural information processing systems, 26, 2013

2013

[8] [8]

Tight nonparametric conver- gence rates for stochastic gradient descent under the noiseless linear model.Advances in Neural Information Processing Systems, 33:2576–2586, 2020

Raphaël Berthier, Francis Bach, and Pierre Gaillard. Tight nonparametric conver- gence rates for stochastic gradient descent under the noiseless linear model.Advances in Neural Information Processing Systems, 33:2576–2586, 2020. 18

2020

[9] [9]

DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

Xiao Bi, Deli Chen, Guanting Chen, Shanhuang Chen, Damai Dai, Chengqi Deng, Honghui Ding, Kai Dong, Qiushi Du, Zhe Fu, et al. Deepseek llm: Scaling open- source language models with longtermism.arXiv preprint arXiv:2401.02954, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[10] [10]

Analyticity and discrete maximal regularity on lp-spaces.Journal of Functional Analysis, 183(1):211–230, 2001

Sönke Blunck. Analyticity and discrete maximal regularity on lp-spaces.Journal of Functional Analysis, 183(1):211–230, 2001

2001

[11] [11]

Optimization methods for large- scale machine learning.SIAM review, 60(2):223–311, 2018

Léon Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for large- scale machine learning.SIAM review, 60(2):223–311, 2018

2018

[12] [12]

Language models are few-shot learners.Advances in neural information pro- cessing systems, 33:1877–1901, 2020

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Pra- fulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information pro- cessing systems, 33:1877–1901, 2020

1901

[13] [13]

Last-iterate convergence of randomized kacz- marz and sgd with greedy step size.Conference on Learning Theory, 2026

Michał Dereziński and Xiaoyu Dong. Last-iterate convergence of randomized kacz- marz and sgd with greedy step size.Conference on Learning Theory, 2026

2026

[14] [14]

Random- ized kaczmarz methods with beyond-krylov convergence.SIAM Journal on Matrix Analysis and Applications, 46(4):2558–2588, 2025

Michał Dereziński, Deanna Needell, Elizaveta Rebrova, and Jiaming Yang. Random- ized kaczmarz methods with beyond-krylov convergence.SIAM Journal on Matrix Analysis and Applications, 46(4):2558–2588, 2025

2025

[15] [15]

Introductory lectures on stochastic optimization.The mathematics of data, 25:99–186, 2018

John C Duchi. Introductory lectures on stochastic optimization.The mathematics of data, 25:99–186, 2018

2018

[16] [16]

From continual learning to sgd and back: Better rates for continual linear models

Itay Evron, Ran Levinstein, Matan Schliserman, Uri Sherman, Tomer Koren, Daniel Soudry, and Nathan Srebro. From continual learning to sgd and back: Better rates for continual linear models. InFourth Conference on Lifelong Learning Agents- Workshop Track, 2025

2025

[17] [17]

Handbook of convergence theorems for (stochastic) gradient methods

Guillaume Garrigos and Robert M Gower. Handbook of convergence theorems for (stochastic) gradient methods.arXiv preprint arXiv:2301.11235, 2023

work page arXiv 2023

[18] [18]

The step decay schedule: A near optimal, geometrically decaying learning rate procedure for least squares.Advances in neural information processing systems, 32, 2019

Rong Ge, Sham M Kakade, Rahul Kidambi, and Praneeth Netrapalli. The step decay schedule: A near optimal, geometrically decaying learning rate procedure for least squares.Advances in neural information processing systems, 32, 2019

2019

[19] [19]

Randomized iterative methods for linear systems.SIAM Journal on Matrix Analysis and Applications, 36(4):1660–1690, 2015

Robert M Gower and Peter Richtárik. Randomized iterative methods for linear systems.SIAM Journal on Matrix Analysis and Applications, 36(4):1660–1690, 2015

2015

[20] [20]

Tight analyses for non-smooth stochastic gradient descent

Nicholas JA Harvey, Christopher Liaw, Yaniv Plan, and Sikander Randhawa. Tight analyses for non-smooth stochastic gradient descent. InConference on Learning Theory, pages 1579–1613. PMLR, 2019

2019

[21] [21]

Making the last iterate of sgd information theoretically optimal

Prateek Jain, Dheeraj Nagaraj, and Praneeth Netrapalli. Making the last iterate of sgd information theoretically optimal. InConference on Learning Theory, pages 1752–1755. PMLR, 2019

2019

[22] [22]

arXiv preprint arXiv:2510.23513 (2025)

Uijeong Jang and Ernest K Ryu. Point convergence of nesterov’s accelerated gradient method: An ai-assisted proof.arXiv preprint arXiv:2510.23513, 2025. 19

work page arXiv 2025

[23] [23]

Société mathématique de France, 2006

Marius Junge, Christian Le Merdy, and Quanhua Xu.H∞ functional calculus and square functions on noncommutative Lp-spaces. Société mathématique de France, 2006

2006

[24] [24]

M. S. Kaczmarz. Angenaherte auflosung von systemen linearer gleichungen.Bulletin International de l’Academie Polonaise des Sciences et des Lettres, 35:355–357, 1937

1937

[25] [25]

Adam: A method for stochastic optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. International Conference on Learning Representations, 2015

2015

[26] [26]

Efficient accelerated coordinate descent methods and faster algorithms for solving linear systems

Yin Tat Lee and Aaron Sidford. Efficient accelerated coordinate descent methods and faster algorithms for solving linear systems. In2013 ieee 54th annual symposium on foundations of computer science, pages 147–156. IEEE, 2013

2013

[27] [27]

Randomized methods for linear constraints: convergence rates and conditioning.Mathematics of Operations Research, 35(3):641– 654, 2010

Dennis Leventhal and Adrian S Lewis. Randomized methods for linear constraints: convergence rates and conditioning.Mathematics of Operations Research, 35(3):641– 654, 2010

2010

[28] [28]

Aiming towards the minimizers: fast convergence of sgd for overparametrized prob- lems.Advances in neural information processing systems, 36:60748–60767, 2023

Chaoyue Liu, Dmitriy Drusvyatskiy, Misha Belkin, Damek Davis, and Yian Ma. Aiming towards the minimizers: fast convergence of sgd for overparametrized prob- lems.Advances in neural information processing systems, 36:60748–60767, 2023

2023

[29] [29]

An accelerated randomized kaczmarz algorithm.Math- ematics of Computation, 85(297):153–178, 2016

Ji Liu and Stephen Wright. An accelerated randomized kaczmarz algorithm.Math- ematics of Computation, 85(297):153–178, 2016

2016

[30] [30]

Revisiting the last-iterate convergence of stochastic gradient methods.The Twelfth International Conference on Learning Representa- tions, 2023

Zijian Liu and Zhengyuan Zhou. Revisiting the last-iterate convergence of stochastic gradient methods.The Twelfth International Conference on Learning Representa- tions, 2023

2023

[31] [31]

Stochastic gradient descent, weighted sampling, and the randomized kaczmarz algorithm.Advances in neural information processing systems, 27, 2014

Deanna Needell, Nathan Srebro, and Rachel Ward. Stochastic gradient descent, weighted sampling, and the randomized kaczmarz algorithm.Advances in neural information processing systems, 27, 2014

2014

[32] [32]

Paved with good intentions: analysis of a ran- domized block kaczmarz method.Linear Algebra and its Applications, 441:199–221, 2014

Deanna Needell and Joel A Tropp. Paved with good intentions: analysis of a ran- domized block kaczmarz method.Linear Algebra and its Applications, 441:199–221, 2014

2014

[33] [33]

Planar point sets with many unit distances

OpenAI. Planar point sets with many unit distances. 2026. Manuscript avail- able at https://cdn.openai.com/pdf/74c24085-19b0-4534-9c90-465b8e29ad73/unit- distance-proof.pdf

2026

[34] [34]

A stochastic approximation method.The annals of mathematical statistics, pages 400–407, 1951

Herbert Robbins and Sutton Monro. A stochastic approximation method.The annals of mathematical statistics, pages 400–407, 1951

1951

[35] [35]

SIAM, 2003

Yousef Saad.Iterative methods for sparse linear systems. SIAM, 2003

2003

[36] [36]

Open problem: Is averaging needed for strongly convex stochastic gradient descent? InConference on Learning Theory, pages 47–1

Ohad Shamir. Open problem: Is averaging needed for strongly convex stochastic gradient descent? InConference on Learning Theory, pages 47–1. JMLR Workshop and Conference Proceedings, 2012

2012

[37] [37]

Stochastic gradient descent for non-smooth op- timization: Convergence results and optimal averaging schemes

Ohad Shamir and Tong Zhang. Stochastic gradient descent for non-smooth op- timization: Convergence results and optimal averaging schemes. InInternational conference on machine learning, pages 71–79. PMLR, 2013. 20

2013

[38] [38]

OUP Oxford, 1999

Kangshen Shen, John N Crossley, and Anthony Wah-Cheung Lun.The nine chapters on the mathematical art: Companion and commentary. OUP Oxford, 1999

1999

[39] [39]

Approximate solutions of linear systems at a universal rate

Stefan Steinerberger. Approximate solutions of linear systems at a universal rate. SIAM Journal on Matrix Analysis and Applications, 44(3):1436–1446, 2023

2023

[40] [40]

A randomized kaczmarz algorithm with exponential convergence.Journal of Fourier Analysis and Applications, 15(2):262– 278, 2009

Thomas Strohmer and Roman Vershynin. A randomized kaczmarz algorithm with exponential convergence.Journal of Fourier Analysis and Applications, 15(2):262– 278, 2009

2009

[41] [41]

Last iterate convergence of sgd for least-squares in the interpolation regime.Advances in Neural Information Processing Systems, 34:21581–21591, 2021

Aditya Vardhan Varre, Loucas Pillaud-Vivien, and Nicolas Flammarion. Last iterate convergence of sgd for least-squares in the interpolation regime.Advances in Neural Information Processing Systems, 34:21581–21591, 2021

2021

[42] [42]

Fast and faster convergence of sgd for over-parameterized models and an accelerated perceptron

Sharan Vaswani, Francis Bach, and Mark Schmidt. Fast and faster convergence of sgd for over-parameterized models and an accelerated perceptron. InThe 22nd international conference on artificial intelligence and statistics, pages 1195–1204. PMLR, 2019

2019

[43] [43]

Acceler- ating scientific research with gemini: Case studies and common techniques.arXiv preprint arXiv:2602.03837, 2026

David P Woodruff, Vincent Cohen-Addad, Lalit Jain, Jieming Mao, Song Zuo, Mo- hammadHossein Bateni, Simina Branzei, Michael P Brenner, Lin Chen, Ying Feng, et al. Accelerating scientific research with gemini: Case studies and common tech- niques.arXiv preprint arXiv:2602.03837, 2026

work page arXiv 2026

[44] [44]

Last iterate risk bounds of sgd with decaying stepsize for overparameterized linear regression

Jingfeng Wu, Difan Zou, Vladimir Braverman, Quanquan Gu, and Sham Kakade. Last iterate risk bounds of sgd with decaying stepsize for overparameterized linear regression. InInternational conference on machine learning, pages 24280–24314. PMLR, 2022

2022

[45] [45]

Exact convergence rate of the last iterate in subgradient methods.SIAM Journal on Optimization, 35(3):2182–2201, 2025

Moslem Zamani and Francois Glineur. Exact convergence rate of the last iterate in subgradient methods.SIAM Journal on Optimization, 35(3):2182–2201, 2025

2025

[46] [46]

Randomized extended kaczmarz for solv- ingleastsquares.SIAM Journal on Matrix Analysis and Applications, 34(2):773–793, 2013

Anastasios Zouzias and Nikolaos M Freris. Randomized extended kaczmarz for solv- ingleastsquares.SIAM Journal on Matrix Analysis and Applications, 34(2):773–793, 2013. 21

2013