pith. sign in

arxiv: 2606.29593 · v1 · pith:ADHTHG7Hnew · submitted 2026-06-28 · 💻 cs.LG · cs.AI· cs.NA· math.NA· math.OC· stat.ML

How AI settled the complexity of the oldest SGD algorithm

Pith reviewed 2026-06-30 07:17 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.NAmath.NAmath.OCstat.ML
keywords Kaczmarz algorithmstochastic gradient descentworst-case complexitylinear systemsconvergence analysisiterative methodsAI-assisted discovery
0
0 comments X

The pith

AI models have discovered the worst-case complexity of the Kaczmarz algorithm, the earliest form of stochastic gradient descent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper recounts how artificial intelligence systems collaborated to resolve the long-standing question of the exact worst-case complexity of the Kaczmarz algorithm first proposed in 1937. This method is recognized as the original stochastic gradient descent procedure and forms a foundational component of the optimization techniques used to train contemporary large language models. A sympathetic reader would care because the result supplies the tightest known performance guarantees for this basic iterative solver of linear systems and illustrates how AI can contribute to settling open theoretical questions in optimization. The narrative frames the discovery as a joint effort among models that produced the complexity bound without prior human derivation.

Core claim

The paper states that AI models working together have determined the precise worst-case complexity of the Kaczmarz algorithm, an iterative procedure for solving linear equations that selects equations at random or in sequence, thereby establishing its fundamental convergence rate that had remained unresolved since its introduction.

What carries the argument

The Kaczmarz algorithm itself, an iterative row-selection method for linear systems that serves as the earliest instance of stochastic gradient descent.

If this is right

  • The result supplies tight convergence guarantees for the foundational SGD variant used in linear solvers.
  • It positions the Kaczmarz method as the historical root whose complexity now serves as a reference point for later SGD variants.
  • The discovery process demonstrates that AI collaboration can resolve open questions in the analysis of iterative algorithms.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same AI-assisted approach could be applied to derive tight bounds for other randomized iterative methods whose complexity remains open.
  • It raises the possibility that theoretical computer science may increasingly rely on automated assistance for identifying extremal examples and proving bounds.
  • Connections appear between this linear-algebra setting and the analysis of convergence rates in modern non-convex optimization.

Load-bearing premise

The assumption that the complexity bound obtained with AI assistance is both correct and had not been established before.

What would settle it

A published proof of the identical complexity bound that predates the AI-assisted derivation, or a concrete linear system on which the algorithm converges more slowly than the claimed bound.

read the original abstract

In 1937, Stefan Kaczmarz proposed a simple algorithm for solving systems of linear equations. This algorithm turned out to be the earliest known example of stochastic gradient descent, a ubiquitous computing paradigm that drives the training of modern AI models such as ChatGPT and Gemini. Now, those AI models have joined forces to discover the worst-case complexity of the Kaczmarz algorithm. This paper tells the story of how it happened.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript narrates how AI models collaborated to discover the worst-case complexity of the Kaczmarz algorithm (the earliest known instance of SGD) and presents this as a story of AI-assisted discovery in optimization.

Significance. If the claimed discovery were both correct and previously unknown, it would be notable for resolving the complexity of a foundational 1937 algorithm using modern AI tools. However, the manuscript supplies neither the bound, a derivation, nor verification details, so the significance cannot be assessed from the provided text.

major comments (2)
  1. [Abstract] Abstract: the central claim that AI models 'have joined forces to discover the worst-case complexity' is unsupported because the manuscript states neither the bound itself nor any derivation, proof sketch, or comparison to prior Kaczmarz analyses (e.g., rates involving the smallest singular value or normalized row norms).
  2. The manuscript is framed as a process narrative, yet the load-bearing assertion of a new, correct, and novel complexity result requires at minimum the explicit bound and a verification argument; their absence renders the claim unevaluable.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. The manuscript is structured as a narrative documenting an AI-assisted discovery process for the Kaczmarz algorithm's complexity. We agree that the central claim requires explicit technical support to be fully evaluable and will revise accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that AI models 'have joined forces to discover the worst-case complexity' is unsupported because the manuscript states neither the bound itself nor any derivation, proof sketch, or comparison to prior Kaczmarz analyses (e.g., rates involving the smallest singular value or normalized row norms).

    Authors: We agree that the abstract's claim would benefit from explicit support. The manuscript prioritizes the discovery narrative over technical exposition, but this leaves the result unevaluable as noted. In revision we will add the explicit worst-case bound, a brief derivation sketch, and comparison to prior analyses (e.g., those based on singular values or row norms). revision: yes

  2. Referee: [—] The manuscript is framed as a process narrative, yet the load-bearing assertion of a new, correct, and novel complexity result requires at minimum the explicit bound and a verification argument; their absence renders the claim unevaluable.

    Authors: The observation is correct: a narrative framing alone cannot substantiate a novel complexity result. We will revise by incorporating the bound and a verification argument (e.g., in a new section or appendix) while preserving the process-oriented structure. revision: yes

Circularity Check

0 steps flagged

Narrative paper presents no derivation chain; no load-bearing steps to inspect

full rationale

The manuscript is a process narrative whose abstract and described content state that AI models discovered the Kaczmarz worst-case complexity but supply neither the bound, any equations, nor a derivation. With no claimed first-principles derivation or mathematical steps present, none of the enumerated circularity patterns (self-definitional, fitted-input prediction, self-citation load-bearing, etc.) can apply. The paper does not attempt to derive the result internally, so the analysis finds no reduction of outputs to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Insufficient information in the abstract to enumerate free parameters, axioms, or invented entities; the paper appears to be a narrative account rather than a technical derivation.

pith-pipeline@v0.9.1-grok · 5606 in / 962 out tokens · 20610 ms · 2026-06-30T07:17:01.862255+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

46 extracted references · 6 canonical work pages · 2 internal anchors

  1. [1]

    First proof.arXiv preprint arXiv:2602.05192, 2026

    Mohammed Abouzaid, Andrew J Blumberg, Martin Hairer, Joe Kileel, Tamara G Kolda, Paul D Nelson, Daniel Spielman, Nikhil Srivastava, Rachel Ward, Shmuel Weinberger, et al. First proof.arXiv preprint arXiv:2602.05192, 2026

  2. [2]

    McGraw-Hill New York, 1960

    Lars Valerian Ahlfors.Complex analysis: an introduction to the theory of analytic functions of one complex variable. McGraw-Hill New York, 1960

  3. [3]

    Forbiddensidonsubsetsofperfectdifferencesets, featuring a human-assisted proof.Proceedings of the National Academy of Sciences, 123(21):e2531760123, 2026

    BorisAlexeevandDustinGMixon. Forbiddensidonsubsetsofperfectdifferencesets, featuring a human-assisted proof.Proceedings of the National Academy of Sciences, 123(21):e2531760123, 2026

  4. [4]

    Remarks on the disproof of the unit distance conjecture

    Noga Alon, Thomas F Bloom, WT Gowers, Daniel Litt, Will Sawin, Arul Shankar, Jacob Tsimerman, Victor Wang, and Melanie Matchett Wood. Remarks on the disproof of the unit distance conjecture.arXiv preprint arXiv:2605.20695, 2026

  5. [5]

    The importance of better models in stochastic opti- mization.Proceedings of the National Academy of Sciences, 116(46):22924–22930, 2019

    Hilal Asi and John C Duchi. The importance of better models in stochastic opti- mization.Proceedings of the National Academy of Sciences, 116(46):22924–22930, 2019

  6. [6]

    Fast last-iterate convergence of sgd in the smooth interpolation regime.Advances in Neural Infor- mation Processing Systems, 38:104951–104987, 2025

    Amit Attia, Matan Schliserman, Uri Sherman, and Tomer Koren. Fast last-iterate convergence of sgd in the smooth interpolation regime.Advances in Neural Infor- mation Processing Systems, 38:104951–104987, 2025

  7. [7]

    Non-strongly-convex smooth stochastic approxi- mation with convergence rate o (1/n).Advances in neural information processing systems, 26, 2013

    Francis Bach and Eric Moulines. Non-strongly-convex smooth stochastic approxi- mation with convergence rate o (1/n).Advances in neural information processing systems, 26, 2013

  8. [8]

    Tight nonparametric conver- gence rates for stochastic gradient descent under the noiseless linear model.Advances in Neural Information Processing Systems, 33:2576–2586, 2020

    Raphaël Berthier, Francis Bach, and Pierre Gaillard. Tight nonparametric conver- gence rates for stochastic gradient descent under the noiseless linear model.Advances in Neural Information Processing Systems, 33:2576–2586, 2020. 18

  9. [9]

    DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

    Xiao Bi, Deli Chen, Guanting Chen, Shanhuang Chen, Damai Dai, Chengqi Deng, Honghui Ding, Kai Dong, Qiushi Du, Zhe Fu, et al. Deepseek llm: Scaling open- source language models with longtermism.arXiv preprint arXiv:2401.02954, 2024

  10. [10]

    Analyticity and discrete maximal regularity on lp-spaces.Journal of Functional Analysis, 183(1):211–230, 2001

    Sönke Blunck. Analyticity and discrete maximal regularity on lp-spaces.Journal of Functional Analysis, 183(1):211–230, 2001

  11. [11]

    Optimization methods for large- scale machine learning.SIAM review, 60(2):223–311, 2018

    Léon Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for large- scale machine learning.SIAM review, 60(2):223–311, 2018

  12. [12]

    Language models are few-shot learners.Advances in neural information pro- cessing systems, 33:1877–1901, 2020

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Pra- fulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information pro- cessing systems, 33:1877–1901, 2020

  13. [13]

    Last-iterate convergence of randomized kacz- marz and sgd with greedy step size.Conference on Learning Theory, 2026

    Michał Dereziński and Xiaoyu Dong. Last-iterate convergence of randomized kacz- marz and sgd with greedy step size.Conference on Learning Theory, 2026

  14. [14]

    Random- ized kaczmarz methods with beyond-krylov convergence.SIAM Journal on Matrix Analysis and Applications, 46(4):2558–2588, 2025

    Michał Dereziński, Deanna Needell, Elizaveta Rebrova, and Jiaming Yang. Random- ized kaczmarz methods with beyond-krylov convergence.SIAM Journal on Matrix Analysis and Applications, 46(4):2558–2588, 2025

  15. [15]

    Introductory lectures on stochastic optimization.The mathematics of data, 25:99–186, 2018

    John C Duchi. Introductory lectures on stochastic optimization.The mathematics of data, 25:99–186, 2018

  16. [16]

    From continual learning to sgd and back: Better rates for continual linear models

    Itay Evron, Ran Levinstein, Matan Schliserman, Uri Sherman, Tomer Koren, Daniel Soudry, and Nathan Srebro. From continual learning to sgd and back: Better rates for continual linear models. InFourth Conference on Lifelong Learning Agents- Workshop Track, 2025

  17. [17]

    Handbook of convergence theorems for (stochastic) gradient methods

    Guillaume Garrigos and Robert M Gower. Handbook of convergence theorems for (stochastic) gradient methods.arXiv preprint arXiv:2301.11235, 2023

  18. [18]

    The step decay schedule: A near optimal, geometrically decaying learning rate procedure for least squares.Advances in neural information processing systems, 32, 2019

    Rong Ge, Sham M Kakade, Rahul Kidambi, and Praneeth Netrapalli. The step decay schedule: A near optimal, geometrically decaying learning rate procedure for least squares.Advances in neural information processing systems, 32, 2019

  19. [19]

    Randomized iterative methods for linear systems.SIAM Journal on Matrix Analysis and Applications, 36(4):1660–1690, 2015

    Robert M Gower and Peter Richtárik. Randomized iterative methods for linear systems.SIAM Journal on Matrix Analysis and Applications, 36(4):1660–1690, 2015

  20. [20]

    Tight analyses for non-smooth stochastic gradient descent

    Nicholas JA Harvey, Christopher Liaw, Yaniv Plan, and Sikander Randhawa. Tight analyses for non-smooth stochastic gradient descent. InConference on Learning Theory, pages 1579–1613. PMLR, 2019

  21. [21]

    Making the last iterate of sgd information theoretically optimal

    Prateek Jain, Dheeraj Nagaraj, and Praneeth Netrapalli. Making the last iterate of sgd information theoretically optimal. InConference on Learning Theory, pages 1752–1755. PMLR, 2019

  22. [22]

    arXiv preprint arXiv:2510.23513 (2025)

    Uijeong Jang and Ernest K Ryu. Point convergence of nesterov’s accelerated gradient method: An ai-assisted proof.arXiv preprint arXiv:2510.23513, 2025. 19

  23. [23]

    Société mathématique de France, 2006

    Marius Junge, Christian Le Merdy, and Quanhua Xu.H∞ functional calculus and square functions on noncommutative Lp-spaces. Société mathématique de France, 2006

  24. [24]

    M. S. Kaczmarz. Angenaherte auflosung von systemen linearer gleichungen.Bulletin International de l’Academie Polonaise des Sciences et des Lettres, 35:355–357, 1937

  25. [25]

    Adam: A method for stochastic optimization

    Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. International Conference on Learning Representations, 2015

  26. [26]

    Efficient accelerated coordinate descent methods and faster algorithms for solving linear systems

    Yin Tat Lee and Aaron Sidford. Efficient accelerated coordinate descent methods and faster algorithms for solving linear systems. In2013 ieee 54th annual symposium on foundations of computer science, pages 147–156. IEEE, 2013

  27. [27]

    Randomized methods for linear constraints: convergence rates and conditioning.Mathematics of Operations Research, 35(3):641– 654, 2010

    Dennis Leventhal and Adrian S Lewis. Randomized methods for linear constraints: convergence rates and conditioning.Mathematics of Operations Research, 35(3):641– 654, 2010

  28. [28]

    Aiming towards the minimizers: fast convergence of sgd for overparametrized prob- lems.Advances in neural information processing systems, 36:60748–60767, 2023

    Chaoyue Liu, Dmitriy Drusvyatskiy, Misha Belkin, Damek Davis, and Yian Ma. Aiming towards the minimizers: fast convergence of sgd for overparametrized prob- lems.Advances in neural information processing systems, 36:60748–60767, 2023

  29. [29]

    An accelerated randomized kaczmarz algorithm.Math- ematics of Computation, 85(297):153–178, 2016

    Ji Liu and Stephen Wright. An accelerated randomized kaczmarz algorithm.Math- ematics of Computation, 85(297):153–178, 2016

  30. [30]

    Revisiting the last-iterate convergence of stochastic gradient methods.The Twelfth International Conference on Learning Representa- tions, 2023

    Zijian Liu and Zhengyuan Zhou. Revisiting the last-iterate convergence of stochastic gradient methods.The Twelfth International Conference on Learning Representa- tions, 2023

  31. [31]

    Stochastic gradient descent, weighted sampling, and the randomized kaczmarz algorithm.Advances in neural information processing systems, 27, 2014

    Deanna Needell, Nathan Srebro, and Rachel Ward. Stochastic gradient descent, weighted sampling, and the randomized kaczmarz algorithm.Advances in neural information processing systems, 27, 2014

  32. [32]

    Paved with good intentions: analysis of a ran- domized block kaczmarz method.Linear Algebra and its Applications, 441:199–221, 2014

    Deanna Needell and Joel A Tropp. Paved with good intentions: analysis of a ran- domized block kaczmarz method.Linear Algebra and its Applications, 441:199–221, 2014

  33. [33]

    Planar point sets with many unit distances

    OpenAI. Planar point sets with many unit distances. 2026. Manuscript avail- able at https://cdn.openai.com/pdf/74c24085-19b0-4534-9c90-465b8e29ad73/unit- distance-proof.pdf

  34. [34]

    A stochastic approximation method.The annals of mathematical statistics, pages 400–407, 1951

    Herbert Robbins and Sutton Monro. A stochastic approximation method.The annals of mathematical statistics, pages 400–407, 1951

  35. [35]

    SIAM, 2003

    Yousef Saad.Iterative methods for sparse linear systems. SIAM, 2003

  36. [36]

    Open problem: Is averaging needed for strongly convex stochastic gradient descent? InConference on Learning Theory, pages 47–1

    Ohad Shamir. Open problem: Is averaging needed for strongly convex stochastic gradient descent? InConference on Learning Theory, pages 47–1. JMLR Workshop and Conference Proceedings, 2012

  37. [37]

    Stochastic gradient descent for non-smooth op- timization: Convergence results and optimal averaging schemes

    Ohad Shamir and Tong Zhang. Stochastic gradient descent for non-smooth op- timization: Convergence results and optimal averaging schemes. InInternational conference on machine learning, pages 71–79. PMLR, 2013. 20

  38. [38]

    OUP Oxford, 1999

    Kangshen Shen, John N Crossley, and Anthony Wah-Cheung Lun.The nine chapters on the mathematical art: Companion and commentary. OUP Oxford, 1999

  39. [39]

    Approximate solutions of linear systems at a universal rate

    Stefan Steinerberger. Approximate solutions of linear systems at a universal rate. SIAM Journal on Matrix Analysis and Applications, 44(3):1436–1446, 2023

  40. [40]

    A randomized kaczmarz algorithm with exponential convergence.Journal of Fourier Analysis and Applications, 15(2):262– 278, 2009

    Thomas Strohmer and Roman Vershynin. A randomized kaczmarz algorithm with exponential convergence.Journal of Fourier Analysis and Applications, 15(2):262– 278, 2009

  41. [41]

    Last iterate convergence of sgd for least-squares in the interpolation regime.Advances in Neural Information Processing Systems, 34:21581–21591, 2021

    Aditya Vardhan Varre, Loucas Pillaud-Vivien, and Nicolas Flammarion. Last iterate convergence of sgd for least-squares in the interpolation regime.Advances in Neural Information Processing Systems, 34:21581–21591, 2021

  42. [42]

    Fast and faster convergence of sgd for over-parameterized models and an accelerated perceptron

    Sharan Vaswani, Francis Bach, and Mark Schmidt. Fast and faster convergence of sgd for over-parameterized models and an accelerated perceptron. InThe 22nd international conference on artificial intelligence and statistics, pages 1195–1204. PMLR, 2019

  43. [43]

    Acceler- ating scientific research with gemini: Case studies and common techniques.arXiv preprint arXiv:2602.03837, 2026

    David P Woodruff, Vincent Cohen-Addad, Lalit Jain, Jieming Mao, Song Zuo, Mo- hammadHossein Bateni, Simina Branzei, Michael P Brenner, Lin Chen, Ying Feng, et al. Accelerating scientific research with gemini: Case studies and common tech- niques.arXiv preprint arXiv:2602.03837, 2026

  44. [44]

    Last iterate risk bounds of sgd with decaying stepsize for overparameterized linear regression

    Jingfeng Wu, Difan Zou, Vladimir Braverman, Quanquan Gu, and Sham Kakade. Last iterate risk bounds of sgd with decaying stepsize for overparameterized linear regression. InInternational conference on machine learning, pages 24280–24314. PMLR, 2022

  45. [45]

    Exact convergence rate of the last iterate in subgradient methods.SIAM Journal on Optimization, 35(3):2182–2201, 2025

    Moslem Zamani and Francois Glineur. Exact convergence rate of the last iterate in subgradient methods.SIAM Journal on Optimization, 35(3):2182–2201, 2025

  46. [46]

    Randomized extended kaczmarz for solv- ingleastsquares.SIAM Journal on Matrix Analysis and Applications, 34(2):773–793, 2013

    Anastasios Zouzias and Nikolaos M Freris. Randomized extended kaczmarz for solv- ingleastsquares.SIAM Journal on Matrix Analysis and Applications, 34(2):773–793, 2013. 21