pith. machine review for the scientific record. sign in

arxiv: 2605.03373 · v1 · submitted 2026-05-05 · 💻 cs.LG

Recognition: unknown

Learning Dynamics of Zeroth-Order Optimization: A Kernel Perspective

Authors on Pith no claims yet

Pith reviewed 2026-05-07 17:26 UTC · model grok-4.3

classification 💻 cs.LG
keywords zeroth-order optimizationneural tangent kernellearning dynamicsJohnson-Lindenstrauss lemmalarge language modelsfine-tuningstochastic gradient descentdimension-free approximation
0
0 comments X

The pith

Zeroth-order SGD produces an empirical neural tangent kernel whose approximation error depends on output dimension and perturbation count rather than parameter count.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Classical theory expects zeroth-order optimization to degrade sharply with growing model size since it relies on finite differences instead of true gradients. Recent practice shows these methods can nevertheless fine-tune language models with billions of parameters. The paper derives the exact one-step update for ZO SGD and observes that the empirical neural tangent kernel appears as the governing term. Each entry of this kernel equals the inner product of two neural tangent vectors after they are projected onto a random subspace whose dimension equals the number of perturbations. The Johnson-Lindenstrauss lemma then bounds the distortion of these inner products, revealing that the error grows with output size but stays independent of the original parameter dimension.

Core claim

We derive the one-step learning dynamics of ZO SGD, where the empirical Neural Tangent Kernel emerges naturally as the key term. Inspection shows that each element of the ZO eNTK is the inner product of neural tangent vectors projected onto a random low-dimensional subspace. Invoking the Johnson-Lindenstrauss Lemma establishes that the fidelity of this approximation is governed primarily by the number of perturbations, with the error depending on model output size rather than parameter dimension. This dimension-free property supplies a theoretical account for the observed success of ZO methods on LLM fine-tuning.

What carries the argument

The zeroth-order empirical neural tangent kernel (ZO eNTK), formed by inner products of neural tangent vectors after random projection via the perturbation directions, which directly controls the one-step parameter update in ZO SGD.

If this is right

  • ZO optimization can scale to models with arbitrarily many parameters without incurring the classical dimension-dependent slowdown.
  • The accuracy of the kernel approximation improves directly with the number of perturbations, independent of model width.
  • The error bound depends on output dimension, so ZO methods remain practical even when the parameter space is enormous.
  • The same kernel perspective can be used to analyze other zeroth-order variants or to compare them with first-order methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the one-step picture extends to many steps, ZO fine-tuning should produce similar feature evolution to first-order training in wide networks.
  • Perturbation distributions other than the standard Gaussian could be tuned to reduce the number of samples needed while preserving the same JL guarantee.
  • Direct numerical checks of the projected tangent vectors on moderate-sized models would provide an immediate empirical test of the derived bound.

Load-bearing premise

The analysis assumes the one-step dynamics capture the dominant learning behavior and that the network is wide enough for the empirical neural tangent kernel to remain a faithful descriptor throughout training.

What would settle it

Compute the full empirical NTK on a small wide network, run ZO SGD with increasing numbers of perturbations, and measure how closely the observed loss trajectory matches the kernel-regression prediction; the gap should shrink with more perturbations while remaining insensitive to further increases in parameter count.

Figures

Figures reproduced from arXiv: 2605.03373 by Bicheng Ying, Haibo Yang, Zhe Li, Zidong Liu.

Figure 1
Figure 1. Figure 1: ZO eNTK v.s. FO eNTK. The pair with high similarity: 4 and 9. The pair with low similarity: 1 and 0. The relative Frobenius norm error ∥ZO − F O∥F = ∥K(xu,xo)−K(xu,xo;Ut,P )∥F ∥K(xu,xo)∥F . 1 100 200 300 400 500 600 700 800 900 1000 Number of Perturbations P 10 1 10 0 10 1 Relative Frobenius Error xu=0, xo=1 xu=4, xo=9 xu=1, xo=7 xu=0, xo=5 view at source ↗
Figure 2
Figure 2. Figure 2: Convergence of Frobenius Norm Error between the ZO and FO eNTK. Pairs with high similarity: 4 and 9, 1 and 7. Pairs with low similarity: 0 and 1, 0 and 5. = log πt(y|xo) + ⟨∇θ log πt(y|xo), θt+1 − θt⟩ + O(∥θt+1 − θt∥ 2 ) − log πt(y|xo). (5) For a sufficiently small learning rate, introducing the ZO￾SGD update gives1 ⟨∇θ log πt(y|xo), θt+1 − θt⟩ ≈ −η[∇θ log πt(y|xo)]utu T t ∇θL(fθ(xu), yu) = −η ∇z log πt(y|… view at source ↗
Figure 3
Figure 3. Figure 3: Calculating and Interpolating of One Element of Projected Empirical Neural Tangent Kernel. ZO approximation effectively converges to the ground truth. At P = 125, the ZO eNTK visually becomes almost the same as the FO eNTK, achieving a minimal Frobenius norm difference of ≈ 0.338. (2) Low Similarity (e.g., xu = 0, xo = 1): Structurally distinct pairs suffer from slower convergence and persistent structural… view at source ↗
Figure 4
Figure 4. Figure 4: Gaussian v.s. Rademacher Distributions (LeNet + MNIST) cipled criterion for determining when the resulting learning dynamics closely match those of FO optimization. 3.3.1. OPTIMIZATION POINT OF VIEW From the optimization point of view, we just need to exam￾ine the ZO-SGD update (2) and the properties of the loss function ℓ. Suppose the loss function ℓ(θ) is L-smooth and the gradient associated with stochas… view at source ↗
Figure 5
Figure 5. Figure 5: ZO Trajectory Comparison between OPT-125M to OPT-1.3B model on SST-2 Task over Different Perturbations. of fourth powers by the square of the Frobenius norm. Denote the Frobenius norms of the Jacobian matrix as Ξt = max(∥∇θz(xo)∥ 2 F , ∥∇θz(xu)∥ 2 F ); Then we have the fol￾lowing succinct formula ∥∆K∥F ≤ ϵ·Ξt · √ V . Substituting this back into Eq. (16) and recalling that ϵ ∼ p (log V )/P, we arrive at the… view at source ↗
Figure 6
Figure 6. Figure 6: Logit Trajectories on Different xo over 200 iterations. 10 0 10 1 10 2 The Number of Perturbations (P) 0.00 0.05 0.10 0.15 0.20 0.25 0.30 Difference OPT-125M Rademacher Gaussian 10 0 10 1 10 2 The Number of Perturbations (P) 0.00 0.05 0.10 0.15 0.20 0.25 0.30 Difference OPT-350M Rademacher Gaussian 10 0 10 1 10 2 The Number of Perturbations (P) 0.00 0.05 0.10 0.15 0.20 0.25 0.30 Difference OPT-1.3B Rademac… view at source ↗
Figure 7
Figure 7. Figure 7: ℓ2 Norm Difference between ZO and FO Model Belief 19 view at source ↗
Figure 8
Figure 8. Figure 8: ZO eNTK v.s. FO eNTK under different test samples xo and a fixed xu = 1. (LeNet, MNIST) 20 view at source ↗
Figure 9
Figure 9. Figure 9: xu = 1 (LeNet, MNIST) 21 view at source ↗
Figure 10
Figure 10. Figure 10: ZO eNTK v.s. FO eNTK under different test samples xo and a fixed xu = 2. (LeNet, MNIST) 22 view at source ↗
Figure 11
Figure 11. Figure 11: xu = 2 (LeNet, MNIST) 23 view at source ↗
Figure 12
Figure 12. Figure 12: ZO eNTK v.s. FO eNTK under different test samples xo and a fixed xu = 4. (LeNet, MNIST) 24 view at source ↗
Figure 13
Figure 13. Figure 13: xu = 4 (LeNet, MNIST) 25 view at source ↗
read the original abstract

Classical optimization theory establishes that zeroth-order (ZO) algorithms suffer from a dimension-dependent slowdown, with convergence rates typically scaling with the model dimension compared to first-order methods. However, in contrast to these theoretical expectations, a growing body of recent work demonstrates the successful application of ZO methods to fine-tuning Large Language Models (LLMs) with billions of parameters. To explain this paradox, we derive the one-step learning dynamics of ZO SGD, where the empirical Neural Tangent Kernel (eNTK) naturally emerges as the key term governing the learning behavior. Inspection of the eNTK produced by ZO SGD reveals that each element corresponds to the inner product of neural tangent vectors projected onto a random low-dimensional subspace. Thus, by invoking the Johnson-Lindenstrauss Lemma, our analysis shows that the fidelity of the ZO eNTK is governed primarily by the number of perturbations. Crucially, the approximation error depends on the model output size rather than the massive parameter dimension. This dimension-free property provides a theoretical justification for the scalability of ZO methods to LLMs finetuning tasks. We believe that this kernel-based framework offers a novel perspective for understanding ZO methods within the context of learning dynamics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper derives the one-step learning dynamics of zeroth-order SGD, in which the empirical Neural Tangent Kernel (eNTK) appears as the governing operator. It observes that each entry of the ZO eNTK is an inner product of tangent vectors projected onto the random subspace spanned by the perturbation directions, and invokes the Johnson-Lindenstrauss lemma to bound the deviation from the full eNTK. The resulting error bound depends on the number of perturbations and the output dimension but is independent of the parameter dimension, thereby supplying a theoretical account for the observed scalability of ZO methods to LLM fine-tuning.

Significance. If the derivation is correct, the work supplies a kernel-theoretic explanation for the practical success of zeroth-order optimization in regimes where classical dimension-dependent rates would predict failure. The dimension-free character of the JL-based bound, obtained from standard one-step linearization and the distributional properties of the perturbations, is a clear strength and offers a concrete link between ZO methods and the NTK literature.

minor comments (3)
  1. The manuscript should explicitly enumerate the assumptions required for the one-step expansion to capture the dominant dynamics (e.g., sufficiently wide networks or specific initialization regimes) and for the JL lemma to apply without additional data-dependent restrictions.
  2. A brief discussion of the higher-order remainder terms omitted by the one-step analysis, together with a statement of the regime in which they remain negligible, would strengthen the claim that the derived dynamics are representative.
  3. Notation for the random perturbation vectors and the precise definition of the ZO eNTK (including how the low-dimensional projections are formed) should be introduced with a dedicated display equation early in the derivation section.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment of our manuscript and the recommendation for minor revision. The summary accurately captures the core technical contribution.

read point-by-point responses
  1. Referee: The paper derives the one-step learning dynamics of zeroth-order SGD, in which the empirical Neural Tangent Kernel (eNTK) appears as the governing operator. It observes that each entry of the ZO eNTK is an inner product of tangent vectors projected onto the random subspace spanned by the perturbation directions, and invokes the Johnson-Lindenstrauss lemma to bound the deviation from the full eNTK. The resulting error bound depends on the number of perturbations and the output dimension but is independent of the parameter dimension, thereby supplying a theoretical account for the observed scalability of ZO methods to LLM fine-tuning.

    Authors: We appreciate the referee's concise and accurate summary of the derivation and its implications. The one-step analysis begins from the ZO gradient estimator and directly yields the projected eNTK as the effective operator; the JL bound then follows from the sub-Gaussian concentration of the random projections, with the error scaling in output dimension rather than parameter dimension as stated. revision: no

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper constructs the ZO eNTK directly from the model outputs and random perturbation directions in the one-step ZO SGD dynamics, then applies the external Johnson-Lindenstrauss Lemma to bound the approximation error of the projected tangent vectors. The error bound depends on the number of perturbations and output dimension (not parameter dimension d), which follows from the standard JL concentration result applied to the fixed collection of tangent vectors. No equation reduces to a fitted parameter renamed as a prediction, no self-citation chain justifies a uniqueness claim, and the central dimension-free property is not smuggled via ansatz or renaming. The derivation remains self-contained against the external lemma and standard NTK linearization.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the standard Johnson-Lindenstrauss Lemma applied to random projections of neural tangent vectors; no free parameters are introduced and no new entities are postulated.

axioms (1)
  • standard math Johnson-Lindenstrauss Lemma applies to the random low-dimensional projections of the tangent vectors
    Invoked in the abstract to bound the fidelity of the ZO eNTK approximation.

pith-pipeline@v0.9.0 · 5518 in / 1365 out tokens · 87637 ms · 2026-05-07T17:26:21.843357+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

52 extracted references · 5 canonical work pages · 1 internal anchor

  1. [1]

    Random Structures & Algorithms , volume=

    An elementary proof of a theorem of Johnson and Lindenstrauss , author=. Random Structures & Algorithms , volume=. 2003 , publisher=

  2. [2]

    Contemporary Mathematics , volume=

    Extensions of Lipschitz mappings into a Hilbert space , author=. Contemporary Mathematics , volume=

  3. [3]

    2018 , publisher=

    High-dimensional probability: An introduction with applications in data science , author=. 2018 , publisher=

  4. [4]

    International Conference on Machine Learning , pages=

    Generalizing gaussian smoothing for random search , author=. International Conference on Machine Learning , pages=. 2022 , organization=

  5. [5]

    Advances in Neural Information Processing Systems , volume=

    On exact computation with an infinitely wide neural net , author=. Advances in Neural Information Processing Systems , volume=

  6. [6]

    Advances in Neural Information Processing Systems , volume=

    Zeroth-order stochastic variance reduction for nonconvex optimization , author=. Advances in Neural Information Processing Systems , volume=

  7. [7]

    arXiv preprint arXiv:2505.13954 , year=

    VAMO: Efficient Large-Scale Nonconvex Optimization via Adaptive Zeroth Order Variance Reduction , author=. arXiv preprint arXiv:2505.13954 , year=

  8. [8]

    Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

    MUZO: Leveraging Multiple Queries and Momentum for Zeroth-Order Fine-Tuning of Large Language Models , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

  9. [9]

    The Fourteenth International Conference on Learning Representations , year=

    Converge Faster, Talk Less: Hessian-Informed Federated Zeroth-Order Optimization , author=. The Fourteenth International Conference on Learning Representations , year=

  10. [10]

    Journal of computer and System Sciences , volume=

    Database-friendly random projections: Johnson-Lindenstrauss with binary coins , author=. Journal of computer and System Sciences , volume=. 2003 , publisher=

  11. [11]

    and Yin, Wotao and Hong, Mingyi and Wang, Zhangyang and Liu, Sijia and Chen, Tianlong , booktitle =

    Zhang, Yihua and Li, Pingzhi and Hong, Junyuan and Li, Jiaxiang and Zhang, Yimeng and Zheng, Wenqing and Chen, Pin-Yu and Lee, Jason D. and Yin, Wotao and Hong, Mingyi and Wang, Zhangyang and Liu, Sijia and Chen, Tianlong , booktitle =. Revisiting Zeroth-Order Optimization for Memory-Efficient. 2024 , volume =

  12. [12]

    The Thirteenth International Conference on Learning Representations , year=

    Revisiting Zeroth-Order Optimization: Minimum-Variance Two-Point Estimators and Directionally Aligned Perturbations , author=. The Thirteenth International Conference on Learning Representations , year=

  13. [13]

    arXiv preprint arXiv:2505.18886 , year=

    KerZOO: Kernel Function Informed Zeroth-Order Optimization for Accurate and Accelerated LLM Fine-Tuning , author=. arXiv preprint arXiv:2505.18886 , year=

  14. [14]

    Advances in Neural Information Processing Systems , volume=

    Localized zeroth-order prompt optimization , author=. Advances in Neural Information Processing Systems , volume=

  15. [15]

    International Conference on Machine Learning , pages=

    Tensor programs iv: Feature learning in infinite-width neural networks , author=. International Conference on Machine Learning , pages=. 2021 , organization=

  16. [16]

    International Conference on Learning Representations , year=

    Finite Depth and Width Corrections to the Neural Tangent Kernel , author=. International Conference on Learning Representations , year=

  17. [17]

    Advances in Neural Information Processing Systems , volume=

    Neural tangent kernel: Convergence and generalization in neural networks , author=. Advances in Neural Information Processing Systems , volume=

  18. [18]

    Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP , pages=

    GLUE: A multi-task benchmark and analysis platform for natural language understanding , author=. Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP , pages=

  19. [19]

    OPT: Open Pre-trained Transformer Language Models

    Opt: Open pre-trained transformer language models , author=. arXiv preprint arXiv:2205.01068 , year=

  20. [20]

    Forty-Second International Conference on Machine Learning , year=

    Natural Perturbations for Black-box Training of Neural Networks by Zeroth-Order Optimization , author=. Forty-Second International Conference on Machine Learning , year=

  21. [21]

    International Conference on Machine Learning , pages=

    Guided evolutionary strategies: Augmenting random search with surrogate gradients , author=. International Conference on Machine Learning , pages=. 2019 , organization=

  22. [22]

    arXiv preprint arXiv:2404.11893 , year=

    Derivative-free optimization via adaptive sampling strategies , author=. arXiv preprint arXiv:2404.11893 , year=

  23. [23]

    The Thirteenth International Conference on Learning Representations , year=

    Achieving Dimension-Free Communication in Federated Learning via Zeroth-Order Optimization , author=. The Thirteenth International Conference on Learning Representations , year=

  24. [24]

    IEEE Signal Processing Magazine , volume=

    A primer on zeroth-order optimization in signal processing and machine learning: Principals, recent advances, and applications , author=. IEEE Signal Processing Magazine , volume=. 2020 , publisher=

  25. [25]

    International Conference on Machine Learning , pages=

    A kernel-based view of language model fine-tuning , author=. International Conference on Machine Learning , pages=. 2023 , organization=

  26. [26]

    Thirty-seventh Conference on Neural Information Processing Systems , year=

    Fine-Tuning Language Models with Just Forward Passes , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=

  27. [27]

    2009 , publisher=

    Introduction to derivative-free optimization , author=. 2009 , publisher=

  28. [28]

    Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

    Zeroth-order fine-tuning of llms in random subspaces , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

  29. [29]

    arXiv preprint arXiv:2601.05501 , year=

    Hi-ZFO: Hierarchical Zeroth-and First-Order LLM Fine-Tuning via Importance-Guided Tensor Selection , author=. arXiv preprint arXiv:2601.05501 , year=

  30. [30]

    Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing , pages=

    Recursive deep models for semantic compositionality over a sentiment treebank , author=. Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing , pages=

  31. [31]

    International Conference on Learning Representations , year=

    Understanding deep learning requires rethinking generalization , author=. International Conference on Learning Representations , year=

  32. [32]

    Proceedings of the National Academy of Sciences , volume=

    Reconciling modern machine-learning practice and the classical bias--variance trade-off , author=. Proceedings of the National Academy of Sciences , volume=. 2019 , publisher=

  33. [33]

    International Conference on Learning Representations , year=

    Measuring the Intrinsic Dimension of Objective Landscapes , author=. International Conference on Learning Representations , year=

  34. [34]

    Intrinsic dimensionality explains the effectiveness of language model fine-tuning , author=. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (volume 1: long papers) , pages=

  35. [35]

    Journal of Machine Learning Research , volume=

    Traces of class/cross-class structure pervade deep learning spectra , author=. Journal of Machine Learning Research , volume=

  36. [36]

    Second-Order Fine-Tuning without Pain for

    Yanjun Zhao and Sizhe Dang and Haishan Ye and Guang Dai and Yi Qian and Ivor Tsang , booktitle=. Second-Order Fine-Tuning without Pain for. 2025 , url=

  37. [37]

    Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security , pages=

    Zoo: Zeroth order optimization based black-box attacks to deep neural networks without training substitute models , author=. Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security , pages=

  38. [38]

    2020 IEEE International Conference on Big Data , pages=

    Pyhessian: Neural networks through the lens of the hessian , author=. 2020 IEEE International Conference on Big Data , pages=. 2020 , organization=

  39. [39]

    The Twelfth International Conference on Learning Representations , year=

    DeepZero: Scaling Up Zeroth-Order Optimization for Deep Model Training , author=. The Twelfth International Conference on Learning Representations , year=

  40. [40]

    Journal of Machine Learning Research , volume=

    An optimal algorithm for bandit and zero-order convex optimization with two-point feedback , author=. Journal of Machine Learning Research , volume=

  41. [41]

    IEEE Transactions on Information Theory , volume=

    Optimal rates for zero-order convex optimization: The power of two function evaluations , author=. IEEE Transactions on Information Theory , volume=. 2015 , publisher=

  42. [42]

    International Conference on Learning Representations , year=

    Gradientless Descent: High-Dimensional Zeroth-Order Optimization , author=. International Conference on Learning Representations , year=

  43. [43]

    The Thirteenth International Conference on Learning Representations , year=

    Enhancing Zeroth-order Fine-tuning for Language Models with Low-rank Structures , author=. The Thirteenth International Conference on Learning Representations , year=

  44. [44]

    Forty-first International Conference on Machine Learning , year=

    Federated Full-Parameter Tuning of Billion-Sized Language Models with Communication Cost under 18 Kilobytes , author=. Forty-first International Conference on Machine Learning , year=

  45. [45]

    Sutherland , booktitle=

    Yi Ren and Danica J. Sutherland , booktitle=. Learning Dynamics of. 2025 , url=

  46. [46]

    IEEE Transactions on Automatic Control , volume=

    Multivariate stochastic approximation using a simultaneous perturbation gradient approximation , author=. IEEE Transactions on Automatic Control , volume=. 2002 , publisher=

  47. [47]

    http://yann

    The MNIST database of handwritten digits , author=. http://yann. lecun. com/exdb/mnist/ , year=

  48. [48]

    IEEE Transactions on Automatic Control , volume=

    Optimal random perturbations for stochastic approximation using a simultaneous perturbation gradient approximation , author=. IEEE Transactions on Automatic Control , volume=. 2002 , publisher=

  49. [49]

    2025 IEEE/ACM International Conference On Computer Aided Design (ICCAD) , pages=

    Perturbation-efficient zeroth-order optimization for hardware-friendly on-device training , author=. 2025 IEEE/ACM International Conference On Computer Aided Design (ICCAD) , pages=. 2025 , organization=

  50. [50]

    Foundations of Computational Mathematics , volume=

    Random gradient-free minimization of convex functions , author=. Foundations of Computational Mathematics , volume=. 2017 , publisher=

  51. [51]

    and Kalai, Adam Tauman and McMahan, H

    Flaxman, Abraham D. and Kalai, Adam Tauman and McMahan, H. Brendan , title =. Proceedings of the Sixteenth Annual ACM-SIAM Symposium on Discrete Algorithms , pages =. 2005 , isbn =

  52. [52]

    SIAM Journal on Optimization , author =

    Stochastic first-and zeroth-order methods for nonconvex stochastic programming , volume =. SIAM Journal on Optimization , author =. 2013 , pages =