pith. sign in

arxiv: 2605.00740 · v1 · submitted 2026-05-01 · 🧮 math.OC · cs.LG· stat.ML

Randomized Subspace Nesterov Accelerated Gradient

Pith reviewed 2026-05-09 18:38 UTC · model grok-4.3

classification 🧮 math.OC cs.LGstat.ML
keywords randomized subspace optimizationNesterov accelerationmatrix smoothnessoracle complexityconvex optimizationsketching methodsaccelerated gradient descent
0
0 comments X

The pith

Randomized-subspace Nesterov accelerated gradient methods attain accelerated oracle complexity for smooth convex optimization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops randomized-subspace versions of Nesterov accelerated gradient methods for smooth convex and strongly convex optimization problems. These methods use only low-dimensional projected gradient information from random sketches, which reduces computational cost in high dimensions. A key three-sequence formulation is introduced that is tailored to matrix smoothness and recovers the standard Nesterov method when the sketch is the full space. The analysis delivers accelerated convergence rates in terms of oracle complexity, with explicit dependence on the matrix smoothness constant and the moment properties of the sketch distribution. This framework allows systematic comparison of different random sketch families and identifies settings where the subspace approach has better oracle complexity than full-dimensional acceleration.

Core claim

We develop randomized-subspace Nesterov accelerated gradient methods for smooth convex and smooth strongly convex optimization under matrix smoothness and generic sketch moment assumptions. The key technical ingredient is a three-sequence formulation tailored to matrix smoothness, which recovers the corresponding classical Nesterov methods in the full-dimensional case. The resulting theory establishes accelerated oracle-complexity guarantees and makes explicit how matrix smoothness and the sketch distribution enter the complexity. It also provides a unified basis for comparing sketch families and identifying when randomized-subspace acceleration improves over full-dimensional Nesterov in the

What carries the argument

A three-sequence formulation tailored to matrix smoothness that incorporates projected-gradient steps from random subspace sketches.

Load-bearing premise

The objective function satisfies matrix smoothness and the random sketches obey generic moment conditions that control the projection errors.

What would settle it

Numerical computation of the actual oracle complexity for a concrete smooth convex function and a specific sketch distribution that violates the predicted accelerated rate under the matrix smoothness assumption.

Figures

Figures reproduced from arXiv: 2605.00740 by Akiko Takeda, Gaku Omiya, Pierre-Louis Poirion.

Figure 1
Figure 1. Figure 1: Oracle-axis convergence on the four quadratic problems. The horizontal axis shows the number of view at source ↗
Figure 2
Figure 2. Figure 2: Oracle-axis comparison for ℓ2-regularized logistic regression on six real-world datasets. The hori￾zontal axis shows oracle calls, and the vertical axis shows f(xk) − fref on a logarithmic scale, where fref is computed by L-BFGS-B. We compare GD, NAG-SC, RS-GD, and RS-NAG-SC with Haar, coordinate, and Gaussian sketches. Each curve is the mean over 3 random seeds, and the shaded region shows one standard de… view at source ↗
Figure 3
Figure 3. Figure 3: Oracle-axis convergence in the sketch-dimension scan. The horizontal axis shows the cumulative view at source ↗
Figure 4
Figure 4. Figure 4: Oracle-axis comparison for ℓ2-regularized logistic regression on six real-world datasets. The hori￾zontal axis shows oracle calls, and the vertical axis shows f(xk) − fref on a logarithmic scale, where fref is a reference value computed by L-BFGS-B. We compare GD, NAG-SC, RS-GD, and RS-NAG-SC with Haar, coordinate, and Gaussian sketches. Each plotted curve is the mean over 10 random seeds, and the shaded r… view at source ↗
read the original abstract

Randomized-subspace methods reduce the cost of first-order optimization by using only low-dimensional projected-gradient information, a feature that is attractive in forward-mode automatic differentiation and communication-limited settings. While Nesterov acceleration is well understood for full-gradient and coordinate-based methods, obtaining accelerated methods for general subspace sketches that use only projected-gradient information and can improve over full-dimensional Nesterov acceleration in oracle complexity is technically nontrivial. We develop randomized-subspace Nesterov accelerated gradient methods for smooth convex and smooth strongly convex optimization under matrix smoothness and generic sketch moment assumptions. The key technical ingredient is a three-sequence formulation tailored to matrix smoothness, which recovers the corresponding classical Nesterov methods in the full-dimensional case. The resulting theory establishes accelerated oracle-complexity guarantees and makes explicit how matrix smoothness and the sketch distribution enter the complexity. It also provides a unified basis for comparing sketch families and identifying when randomized-subspace acceleration improves over full-dimensional Nesterov acceleration in oracle complexity.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper develops randomized-subspace Nesterov accelerated gradient methods for smooth convex and smooth strongly convex optimization. It relies on matrix smoothness and generic sketch moment assumptions, introducing a three-sequence formulation that recovers classical Nesterov acceleration when the sketch is full-dimensional. The resulting theory provides accelerated oracle-complexity bounds that explicitly incorporate the matrix smoothness constant and properties of the sketch distribution, offering a unified way to compare sketch families and identify regimes where subspace acceleration yields better oracle complexity than standard full-dimensional Nesterov.

Significance. If the central derivations hold, the work fills a gap in accelerated first-order methods for projected-gradient settings relevant to automatic differentiation and distributed optimization. The explicit dependence of the bounds on sketch moments is a clear strength, enabling theoretical guidance on sketch selection. Recovery of the classical case as a special instance strengthens internal consistency. The contribution would be more impactful if the analysis demonstrates concrete improvement over full-dimensional Nesterov for some sketch families, rather than only formal acceleration under generic assumptions.

major comments (2)
  1. [§4 (oracle complexity theorem)] The main complexity result (likely the theorem in §4 deriving the oracle bounds): the generic sketch moment assumptions are used to obtain accelerated rates, but the manuscript does not explicitly derive or exhibit a condition on the sketch (e.g., a bound on the second-moment operator or variance that reduces the effective smoothness factor below the full-dimensional case after projection). Without this, the claim that the theory allows 'identifying when' randomized-subspace acceleration improves over full-dimensional Nesterov remains formal rather than operational.
  2. [§3 (three-sequence formulation)] §3 (three-sequence formulation): the construction is tailored to matrix smoothness and stated to recover classical Nesterov in the full-dimensional limit, but the reduction step must be verified to ensure no extraneous dimension-dependent inflation appears in the leading constant when the sketch becomes the identity; otherwise the comparison to standard Nesterov is not direct.
minor comments (2)
  1. [Abstract] The abstract and introduction could include the precise form of the accelerated rate (e.g., dependence on sketch size or moment parameters) to make the improvement claim more concrete at first reading.
  2. [Notation and preliminaries] Notation for the sketch distribution and its moment assumptions should be introduced once and used uniformly; occasional redefinition of symbols for the projected gradient or momentum terms reduces readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading, constructive feedback, and recognition of the paper's contributions to randomized-subspace acceleration. We address each major comment below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [§4 (oracle complexity theorem)] The main complexity result (likely the theorem in §4 deriving the oracle bounds): the generic sketch moment assumptions are used to obtain accelerated rates, but the manuscript does not explicitly derive or exhibit a condition on the sketch (e.g., a bound on the second-moment operator or variance that reduces the effective smoothness factor below the full-dimensional case after projection). Without this, the claim that the theory allows 'identifying when' randomized-subspace acceleration improves over full-dimensional Nesterov remains formal rather than operational.

    Authors: We agree that while the explicit dependence of the oracle bounds on sketch moments and the matrix smoothness constant permits comparison in principle, the manuscript stops short of deriving a concrete condition or example for a specific sketch family that reduces the effective smoothness below the full-dimensional case. In the revision we will add a dedicated remark (or short subsection) that derives such a condition for representative sketches (e.g., random coordinate and Gaussian sketches) and identifies the parameter regimes where the projected smoothness constant is strictly smaller, thereby making the identification operational. revision: yes

  2. Referee: [§3 (three-sequence formulation)] §3 (three-sequence formulation): the construction is tailored to matrix smoothness and stated to recover classical Nesterov in the full-dimensional limit, but the reduction step must be verified to ensure no extraneous dimension-dependent inflation appears in the leading constant when the sketch becomes the identity; otherwise the comparison to standard Nesterov is not direct.

    Authors: We thank the referee for highlighting the need for an explicit verification. In the revised version we will insert a short lemma (or corollary) in §3 that directly substitutes the identity sketch into the three-sequence recurrence and the complexity bound, confirming that the leading constants match those of classical Nesterov acceleration exactly, with no additional dimension-dependent factors introduced by the formulation. revision: yes

Circularity Check

0 steps flagged

Derivation proceeds from stated assumptions without reduction to inputs by construction

full rationale

The paper starts from matrix smoothness and generic sketch moment assumptions, introduces a three-sequence formulation that is explicitly constructed to recover classical Nesterov when the sketch is full-dimensional, and then derives oracle-complexity bounds that express the dependence on the smoothness constant and sketch moments. These bounds are obtained by standard Lyapunov-style analysis rather than by fitting parameters to the target rate or by invoking a self-citation whose content is the result itself. No step equates a derived quantity to a fitted input or renames an empirical pattern; the analysis remains self-contained against the external benchmarks of classical Nesterov acceleration.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on two domain assumptions standard in optimization theory; no free parameters or invented entities are mentioned in the abstract.

axioms (2)
  • domain assumption Matrix smoothness assumption
    Invoked to enable the three-sequence formulation and to express complexity in terms of the smoothness matrix.
  • domain assumption Generic sketch moment assumptions
    Required to obtain the oracle complexity guarantees for randomized subspace methods.

pith-pipeline@v0.9.0 · 5469 in / 1263 out tokens · 31502 ms · 2026-05-09T18:38:55.857548+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages

  1. [1]

    Even faster accelerated coordinate descent using non-uniform sampling

    Zeyuan Allen-Zhu, Zheng Qu, Peter Richt´ arik, and Yang Yuan. Even faster accelerated coordinate descent using non-uniform sampling. InProceedings of the 33rd International Conference on Machine Learning, volume 48 ofProceedings of Machine Learning Research, pages 1110–1119, 2016

  2. [2]

    Notterman, Kurt Gish, Suzanne Ybarra, Daniel Mack, and Arnold J

    Uri Alon, Naama Barkai, Daniel A. Notterman, Kurt Gish, Suzanne Ybarra, Daniel Mack, and Arnold J. Levine. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays.Proceedings of the National Academy of Sciences, 96(12): 6745–6750, 1999

  3. [3]

    arXiv preprint arXiv:2202.08587 , year=

    Atılım G¨ une¸ s Baydin, Barak A. Pearlmutter, Don Syme, Frank Wood, and Philip Torr. Gradients without backpropagation, 2022. arXiv:2202.08587

  4. [4]

    Byrd, Peihuang Lu, Jorge Nocedal, and Ciyou Zhu

    Richard H. Byrd, Peihuang Lu, Jorge Nocedal, and Ciyou Zhu. A limited memory algorithm for bound constrained optimization.SIAM Journal on Scientific Computing, 16(5):1190–1208, 1995

  5. [5]

    Scalable subspace methods for derivative-free nonlinear least-squares optimization.Mathematical Programming, 199(1–2):461–524, 2023

    Coralia Cartis and Lindon Roberts. Scalable subspace methods for derivative-free nonlinear least-squares optimization.Mathematical Programming, 199(1–2):461–524, 2023

  6. [6]

    LIBSVM: A library for support vector machines.ACM Trans- actions on Intelligent Systems and Technology, 2(3):27:1–27:27, 2011

    Chih-Chung Chang and Chih-Jen Lin. LIBSVM: A library for support vector machines.ACM Trans- actions on Intelligent Systems and Technology, 2(3):27:1–27:27, 2011

  7. [7]

    Accelerated, parallel, and proximal coordinate descent.SIAM Journal on Optimization, 25(4):1997–2023, 2015

    Olivier Fercoq and Peter Richt´ arik. Accelerated, parallel, and proximal coordinate descent.SIAM Journal on Optimization, 25(4):1997–2023, 2015

  8. [8]

    Problem-dependent convergence bounds for ran- domized linear gradient compression, 2024

    Thomas Flynn, Patrick Johnstone, and Shinjae Yoo. Problem-dependent convergence bounds for ran- domized linear gradient compression, 2024. arXiv:2411.12898. 10

  9. [9]

    Golub, Donna K

    Todd R. Golub, Donna K. Slonim, Pablo Tamayo, Christine Huard, Michelle Gaasenbeek, Jill P. Mesirov, Hilary Coller, Mignon L. Loh, James R. Downing, Mark A. Caligiuri, Clara D. Bloomfield, and Eric S. Lander. Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring.Science, 286(5439):531–537, 1999

  10. [10]

    Result analysis of the NIPS 2003 feature selection challenge

    Isabelle Guyon, Steve Gunn, Asa Ben-Hur, and Gideon Dror. Result analysis of the NIPS 2003 feature selection challenge. InAdvances in Neural Information Processing Systems, volume 17, pages 545–552, 2005

  11. [11]

    Isabelle Guyon, Amir Saffari, Gideon Dror, and Gavin C. Cawley. Agnostic learning vs. prior knowledge challenge. InProceedings of the International Joint Conference on Neural Networks, pages 829–834, 2007

  12. [12]

    Predicting a biological response

    Ben Hamner, dcthompson, and Jorg. Predicting a biological response. Kaggle competition, 2012. URL https://kaggle.com/competitions/bioresponse

  13. [13]

    Accelerated coordinate descent with arbitrary sampling and best rates for minibatches

    Filip Hanzely and Peter Richt´ arik. Accelerated coordinate descent with arbitrary sampling and best rates for minibatches. InProceedings of the 22nd International Conference on Artificial Intelligence and Statistics, volume 89 ofProceedings of Machine Learning Research, pages 304–312, 2019

  14. [14]

    Warren Hare, Lindon Roberts, and Cl´ ement W. Royer. Expected decrease for derivative-free algorithms using random subspaces.Mathematics of Computation, 94(351):277–304, 2025

  15. [15]

    Prox-PDA: The proximal primal-dual algorithm for fast distributed nonconvex optimization and learning over networks

    Mingyi Hong, Davood Hajinezhad, and Ming-Min Zhao. Prox-PDA: The proximal primal-dual algorithm for fast distributed nonconvex optimization and learning over networks. InProceedings of the 34th International Conference on Machine Learning, volume 70 ofProceedings of Machine Learning Research, pages 1529–1538, 2017

  16. [16]

    The UCI machine learning repository.https: //archive.ics.uci.edu, 2023

    Markelle Kelly, Rachel Longjohn, and Kolby Nottingham. The UCI machine learning repository.https: //archive.ics.uci.edu, 2023

  17. [17]

    A stochastic subspace approach to gradient-free optimization in high dimensions.Computational Optimization and Applications, 79(2): 339–368, 2021

    David Kozak, Stephen Becker, Alireza Doostan, and Luis Tenorio. A stochastic subspace approach to gradient-free optimization in high dimensions.Computational Optimization and Applications, 79(2): 339–368, 2021

  18. [18]

    Zeroth-order optimiza- tion with orthogonal random directions.Mathematical Programming, 199(1–2):1179–1219, 2023

    David Kozak, Cesare Molinari, Lorenzo Rosasco, Luis Tenorio, and Silvia Villa. Zeroth-order optimiza- tion with orthogonal random directions.Mathematical Programming, 199(1–2):1179–1219, 2023

  19. [19]

    Det-CGD: Compressed gradient descent with matrix stepsizes for non-convex optimization

    Hanmin Li, Avetik Karagulyan, and Peter Richt´ arik. Det-CGD: Compressed gradient descent with matrix stepsizes for non-convex optimization. InProceedings of the 12th International Conference on Learning Representations, 2024

  20. [20]

    Acceleration for compressed gradient descent in distributed and federated optimization

    Zhize Li, Dmitry Kovalev, Xun Qian, and Peter Richt´ arik. Acceleration for compressed gradient descent in distributed and federated optimization. InProceedings of the 37th International Conference on Machine Learning, volume 119 ofProceedings of Machine Learning Research, pages 5895–5904, 2020

  21. [21]

    An accelerated proximal coordinate gradient method

    Qihang Lin, Zhaosong Lu, and Lin Xiao. An accelerated proximal coordinate gradient method. In Advances in Neural Information Processing Systems, volume 27, pages 3059–3067, 2014

  22. [22]

    On the complexity analysis of randomized block-coordinate descent methods

    Zhaosong Lu and Lin Xiao. On the complexity analysis of randomized block-coordinate descent methods. Mathematical Programming, 152(1–2):615–642, 2015

  23. [23]

    Gradskip: Communication-accelerated local gradient methods with better computational complexity.Transactions on Machine Learning Research, 2025

    Artavazd Maranjyan, Mher Safaryan, and Peter Richt´ arik. Gradskip: Communication-accelerated local gradient methods with better computational complexity.Transactions on Machine Learning Research, 2025

  24. [24]

    A method of solving a convex programming problem with convergence rateO(1/k 2)

    Yurii Nesterov. A method of solving a convex programming problem with convergence rateO(1/k 2). Soviet Mathematics Doklady, 27(2):372–376, 1983. 11

  25. [25]

    Springer, 2004

    Yurii Nesterov.Introductory Lectures on Convex Optimization: A Basic Course, volume 87 ofApplied Optimization. Springer, 2004

  26. [26]

    Efficiency of coordinate descent methods on huge-scale optimization problems.SIAM Journal on Optimization, 22(2):341–362, 2012

    Yurii Nesterov. Efficiency of coordinate descent methods on huge-scale optimization problems.SIAM Journal on Optimization, 22(2):341–362, 2012

  27. [27]

    Random gradient-free minimization of convex functions.Foun- dations of Computational Mathematics, 17(2):527–566, 2017

    Yurii Nesterov and Vladimir Spokoiny. Random gradient-free minimization of convex functions.Foun- dations of Computational Mathematics, 17(2):527–566, 2017

  28. [28]

    Randomized subspace gradient method for constrained optimization, 2023

    Ryota Nozawa, Pierre-Louis Poirion, and Akiko Takeda. Randomized subspace gradient method for constrained optimization, 2023. arXiv:2307.03335

  29. [29]

    Convergence analysis of randomized subspace normalized SGD under heavy-tailed noise, 2026

    Gaku Omiya, Pierre-Louis Poirion, and Akiko Takeda. Convergence analysis of randomized subspace normalized SGD under heavy-tailed noise, 2026. arXiv:2601.20399

  30. [30]

    John C. Platt. Fast training of support vector machines using sequential minimal optimization. In Bernhard Sch¨ olkopf, Christopher J. C. Burges, and Alexander J. Smola, editors,Advances in Kernel Methods: Support Vector Learning, pages 185–208. MIT Press, Cambridge, MA, 1999

  31. [31]

    Ijcnn 2001 neural network competition

    Danil Prokhorov. Ijcnn 2001 neural network competition. Slide presentation in IJCNN’01, Ford Research Laboratory, 2001

  32. [32]

    Coordinate descent with arbitrary sampling I: Algorithms and complex- ity.Optimization Methods and Software, 31(5):829–857, 2016

    Zheng Qu and Peter Richt´ arik. Coordinate descent with arbitrary sampling I: Algorithms and complex- ity.Optimization Methods and Software, 31(5):829–857, 2016

  33. [33]

    Iteration complexity of randomized block-coordinate descent meth- ods for minimizing a composite function.Mathematical Programming, 144(1–2):1–38, 2014

    Peter Richt´ arik and Martin Tak´ aˇ c. Iteration complexity of randomized block-coordinate descent meth- ods for minimizing a composite function.Mathematical Programming, 144(1–2):1–38, 2014

  34. [34]

    Lindon Roberts and Cl´ ement W. Royer. Direct search based on probabilistic descent in reduced spaces. SIAM Journal on Optimization, 33(4):3057–3082, 2023

  35. [35]

    Smoothness matrices beat smoothness constants: Better communication compression techniques for distributed optimization

    Mher Safaryan, Filip Hanzely, and Peter Richt´ arik. Smoothness matrices beat smoothness constants: Better communication compression techniques for distributed optimization. InAdvances in Neural Information Processing Systems, volume 34, pages 25688–25702, 2021

  36. [36]

    van Rijn, Bernd Bischl, and Luis Torgo

    Joaquin Vanschoren, Jan N. van Rijn, Bernd Bischl, and Luis Torgo. OpenML: Networked science in machine learning.ACM SIGKDD Explorations Newsletter, 15(2):49–60, 2013

  37. [37]

    Theoretically better and numerically faster dis- tributed optimization with smoothness-aware quantization techniques

    Bokun Wang, Mher Safaryan, and Peter Richt´ arik. Theoretically better and numerically faster dis- tributed optimization with smoothness-aware quantization techniques. InAdvances in Neural Informa- tion Processing Systems, volume 35, pages 9841–9852, 2022

  38. [38]

    Olson, Jeffrey R

    Mike West, Carrie Blanchette, Holly Dressman, Erich Huang, Seiichi Ishida, Rainer Spang, Harry Zuzan, John A. Olson, Jeffrey R. Marks, and Joseph R. Nevins. Predicting the clinical status of human breast cancer by using gene expression profiles.Proceedings of the National Academy of Sciences of the United States of America, 98(20):11462–11467, 2001

  39. [39]

    Stephen J. Wright. Coordinate descent algorithms.Mathematical Programming, 151(1):3–34, 2015

  40. [40]

    Johansson

    Tao Yang, Xinlei Yi, Junfeng Wu, Ye Yuan, Di Wu, Ziyang Meng, Yiguang Hong, Hong Wang, Zongli Lin, and Karl H. Johansson. A survey of distributed optimization.Annual Reviews in Control, 47: 278–305, 2019

  41. [41]

    Byrd, Peihuang Lu, and Jorge Nocedal

    Ciyou Zhu, Richard H. Byrd, Peihuang Lu, and Jorge Nocedal. Algorithm 778: L-BFGS-B: Fortran sub- routines for large-scale bound-constrained optimization.ACM Transactions on Mathematical Software, 23(4):550–560, 1997. 12 A Related work A.1 Other random-subspace derivative-free and zeroth-order methods Random-subspace ideas have also been studied in relate...

  42. [42]

    RS-GD (Kozak et al.)

    with dataset IDs 1039 and 46912, respectively; their OpenML license fields are listed asPublicand Public Domain, respectively. The remaining datasets were obtained from the LIBSVM binary-classification dataset page [6], which provides LIBSVM-formatted versions of datasets from existing public benchmark collections and lists source and preprocessing inform...