pith. machine review for the scientific record. sign in

arxiv: 2605.09034 · v1 · submitted 2026-05-09 · 💻 cs.LG

Recognition: no theorem link

Accelerating Zeroth-Order Spectral Optimization with Partial Orthogonalization from Power Iteration

Jiahe Chen, Ziye Ma

Pith reviewed 2026-05-12 02:14 UTC · model grok-4.3

classification 💻 cs.LG
keywords zeroth-order optimizationspectral optimizationpower iterationorthogonalizationLLM fine-tuningvariance reductionMuonsubspace projection
0
0 comments X

The pith

Partial orthogonalization from power iteration speeds up zeroth-order spectral optimization for LLM fine-tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Zeroth-order optimization lets models adapt to local data without back-propagation, which matters for memory-limited edge devices. Full orthogonalization, which helps spectral methods like Muon in the first-order case, turns counterproductive once gradients are replaced by noisy finite-difference estimates. The paper replaces the Newton-Schulz step with a power-iteration procedure that amplifies only the strongest directions and pairs it with a streaming momentum-subspace projection to keep variance low. This combination produces 1.5x to 4x faster convergence than the prior best zeroth-order Muon variant on SuperGlue tasks with the OPT-13B model while remaining competitive in final accuracy against MeZO, LOZO, and other baselines.

Core claim

Full orthogonalization of zeroth-order gradient estimates harms performance because the estimates are too noisy; replacing it with partial orthogonalization obtained from a streaming power-iteration method applied inside a momentum-projected subspace recovers the spectral advantage and accelerates convergence.

What carries the argument

Partial orthogonalization, which uses power iteration instead of Newton-Schulz to strengthen only dominant spectral directions while restricting updates to a low-variance momentum subspace.

Load-bearing premise

That noisy zeroth-order gradient estimates make complete orthogonalization harmful while power iteration focused on dominant directions remains beneficial.

What would settle it

An experiment that applies full Newton-Schulz orthogonalization to the identical zeroth-order gradient estimates and still records faster or equal convergence speed would remove the stated reason for switching to power iteration.

Figures

Figures reproduced from arXiv: 2605.09034 by Jiahe Chen, Ziye Ma.

Figure 1
Figure 1. Figure 1: Comparison of FO and ZO gradient spectra on the [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Gemma2-2B fine-tuning comparison. Panel (a) reports test accuracy versus wall-clock time [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: ZO-MOPI’s training loss with and without momentum. The coordinate transformation is given by: Mnew t−1 ← (Anew) ⊤AoldMold t−1 , (11) where Mold t−1 ,Mnew t−1 ∈ R r×n are the momentum represen￾tations in the old and new coordinate systems respectively. Momentum plays a very important role in our algorithm beyond reducing variance, since the effectiveness of SPI is contingent on it. This contrasts with the a… view at source ↗
Figure 4
Figure 4. Figure 4: Lazy sampling interval choices Lazy sampling strategy is equally crucial for SPI. In our implementation, we fix the subspace A and resample it every ν iteration. We need periodic updates in A to encour￾age exploration, but shifting it too frequently may hinder the efficacy of SPI since it requires strong continuity. If the subspace changes too frequently, the spectral directions can vary substantially acro… view at source ↗
Figure 5
Figure 5. Figure 5: OPT-13B fine-tuning efficiency (Accuracy vs. Wall-clock time) across four SuperGLUE [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: OPT-13B fine-tuning efficiency (Relative time-to-same-accuracy) across four SuperGLUE [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Effect of the spectral rank k for fine-tuning OPT-1.3B on SST-2. The two panels show test accuracy and training loss under different choices of k. 0 100 200 300 400 500 Wall-Clock Time (s) 60 65 70 75 80 85 90 Accuracy (%) r=32 r=64 r=128 r=256 0 100 200 300 400 500 600 Wall-Clock Time (s) 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Train Loss r=32 r=64 r=128 r=256 (a) Accuracy vs. wall-clock time (b) Training los… view at source ↗
Figure 8
Figure 8. Figure 8: Effect of the subspace rank r for fine-tuning OPT-2.7B on SST-2. The two panels show test accuracy and training loss under different choices of r. we follow common settings in prior work and use rank r = 8 with scaling factor α = 16. Detailed settings for both ZO and FO methods are summarized in [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: OPT-13B training loss curves across four SuperGLUE tasks. Each panel reports training [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: OPT-13B evaluation loss curves across four SuperGLUE tasks. Each panel reports [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: LLaMA3-8B training loss curves across four SuperGLUE tasks. Each panel reports [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: LLaMA3-8B accuracy curves across four SuperGLUE tasks. Each panel reports accuracy [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Gemma2-2B training loss curves across four SuperGLUE tasks. Each panel reports [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Gemma2-2B accuracy curves across four SuperGLUE tasks. Each panel reports training [PITH_FULL_IMAGE:figures/full_fig_p021_14.png] view at source ↗
read the original abstract

Zeroth-order (ZO) optimization has become increasingly popular and important in fine-tuning large language models (LLMs), especially on edge devices due to its ability to adjust the model to local data without the need for memory-intensive back-propagation. Recent works try to reduce ZO variance through low-dimensional subspace search, but subspace restriction alone leaves key optimization geometry under-exploited, motivating additional acceleration. In this work, we focus on the hidden layer training problem in which spectral optimizers like Muon outperform AdamW due to its ability to exploit weak spectral directions by orthogonalization. However, we have discovered that unlike in the first-order setting, full orthogonalization works poorly in the ZO setting since the gradient estimates are highly noisy and unreliable. To address this issue, we propose a key approach we call partial orthogonalization. To do so, we replace the iconic Newton-Schulz procedure in Muon with the faster, more concentrated power-iteration method so that it only amplifies dominant spectral directions. Furthermore, to improve the efficiency and generalization of the algorithm, we adopted a streaming variant of power-iteration that requires low variance in gradients, which was achieved through constraining our search inside a subspace obtained through the projection of momentum, echoing recent advances. Experiments on LLM fine-tuning show that our method can achieve from 1.5x to 4x the convergence speed of ZO-Muon, the current SOTA algorithm, across SuperGlue datasets in the OPT-13B model. Across different models, we also reach competitive final accuracies with less time in most cases compared with strong ZO baselines such as MeZO, LOZO and ZO-Muon. Code is available at https://github.com/MOFA-LAB/ZO-MOPI.git.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes partial orthogonalization for zeroth-order spectral optimization in LLM fine-tuning. It replaces the Newton-Schulz procedure from Muon with a power-iteration method to amplify only dominant spectral directions under noisy ZO gradients, combined with a streaming power-iteration variant and momentum-subspace projection for variance reduction. The central claim is that this yields 1.5x to 4x faster convergence than ZO-Muon on the OPT-13B model across SuperGlue datasets, with competitive final accuracies versus MeZO, LOZO, and ZO-Muon baselines. Code is provided at https://github.com/MOFA-LAB/ZO-MOPI.git.

Significance. If the reported speedups are reproducible and attributable to the partial-orthogonalization change, the work would offer a practical improvement to ZO methods for memory-efficient LLM adaptation by exploiting spectral geometry more robustly under noise. The open-sourced implementation is a clear strength supporting verification and extension.

major comments (3)
  1. The central empirical claim of 1.5x–4x acceleration over ZO-Muon rests on experiments whose setup is not detailed (no mention of run count, error bars, hyperparameter grids, or statistical significance tests). This directly affects assessment of whether the gains are reliable and generalizable.
  2. The motivation that full orthogonalization (Newton-Schulz) fails in the ZO regime because of unreliable gradient estimates is stated without supporting measurements, such as variance of the estimated spectral directions or side-by-side comparison of full versus partial orthogonalization under identical momentum-subspace constraints.
  3. No ablation isolating the power-iteration replacement from the momentum-subspace projection is reported. For example, there is no experiment comparing ZO-Muon augmented only with the subspace projection but retaining full Newton-Schulz, so the specific contribution of partial orthogonalization to the observed speedups cannot be verified.
minor comments (2)
  1. The description of the streaming power-iteration update and the exact form of the momentum-subspace projection would benefit from explicit pseudocode or additional equations to aid re-implementation.
  2. Figure captions and axis labels in the experimental plots should explicitly state the number of independent runs and whether shaded regions represent standard deviation or standard error.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments identify key areas where additional details and evidence will strengthen the manuscript. We address each major comment below and will revise accordingly.

read point-by-point responses
  1. Referee: The central empirical claim of 1.5x–4x acceleration over ZO-Muon rests on experiments whose setup is not detailed (no mention of run count, error bars, hyperparameter grids, or statistical significance tests). This directly affects assessment of whether the gains are reliable and generalizable.

    Authors: We agree that the experimental protocol requires more detail for reproducibility and to support the reliability of the claims. In the revised manuscript we will expand Section 4 to explicitly state the number of independent runs (3 random seeds), report standard deviation error bars on all convergence plots, describe the hyperparameter grids and tuning procedure applied to each baseline and our method, and include statistical significance tests (paired t-tests on final accuracies) where relevant. The reported speedups reflect wall-clock time to target accuracy on SuperGLUE tasks with OPT-13B. revision: yes

  2. Referee: The motivation that full orthogonalization (Newton-Schulz) fails in the ZO regime because of unreliable gradient estimates is stated without supporting measurements, such as variance of the estimated spectral directions or side-by-side comparison of full versus partial orthogonalization under identical momentum-subspace constraints.

    Authors: We acknowledge that the manuscript presents the motivation without accompanying quantitative measurements. Our observation arose from early development runs showing unstable updates with full Newton-Schulz under ZO noise. We will add a new figure and accompanying text in the revision that reports the variance of estimated dominant singular vectors for full Newton-Schulz versus partial power iteration, both with and without the momentum-subspace projection, thereby providing direct evidence for the robustness advantage of partial orthogonalization. revision: yes

  3. Referee: No ablation isolating the power-iteration replacement from the momentum-subspace projection is reported. For example, there is no experiment comparing ZO-Muon augmented only with the subspace projection but retaining full Newton-Schulz, so the specific contribution of partial orthogonalization to the observed speedups cannot be verified.

    Authors: We agree that isolating the contribution of the partial-orthogonalization change is important. The current experiments compare the complete proposed algorithm against ZO-Muon. We will add an ablation study in the revised experimental section that includes ZO-Muon augmented solely with the momentum-subspace projection (retaining Newton-Schulz) alongside the full ZO-MOPI method. This will allow direct verification of the incremental benefit from replacing Newton-Schulz with power iteration. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation; algorithmic proposal validated empirically

full rationale

The paper proposes a practical algorithmic change—replacing Newton-Schulz full orthogonalization in Muon with a streaming power-iteration partial orthogonalization combined with momentum-subspace projection—to address observed noise in zeroth-order gradient estimates. No equations are presented that reduce the method or its claimed speedups to self-definitions, fitted parameters renamed as predictions, or load-bearing self-citations. The central claims rest on empirical results across SuperGlue tasks and OPT-13B, which are independent of the derivation itself. The approach echoes prior subspace ideas but does not smuggle in ansatzes or uniqueness theorems from the authors' own prior work as unverified axioms. This is a standard self-contained algorithmic contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The approach is primarily algorithmic and empirical with no explicit free parameters, axioms, or invented entities stated in the abstract; it builds on existing ZO and spectral optimization concepts.

pith-pipeline@v0.9.0 · 5623 in / 955 out tokens · 45802 ms · 2026-05-12T02:14:00.026278+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · 5 internal anchors

  1. [1]

    Enhancing zeroth-order fine-tuning for language models with low-rank structures.arXiv preprint arXiv:2410.07698,

    Yiming Chen, Yuan Zhang, Liyuan Cao, Kun Yuan, and Zaiwen Wen. Enhancing zeroth-order fine-tuning for language models with low-rank structures.arXiv preprint arXiv:2410.07698,

  2. [2]

    Boolq: Exploring the surprising difficulty of natural yes/no questions

    Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions. InProceedings of the 2019 conference of the north American chapter of the association for computational linguistics: Human language technologies, volume 1 (long and short papers), ...

  3. [3]

    Variance- reduced zeroth-order methods for fine-tuning language models.arXiv preprint arXiv:2404.08080,

    Tanmay Gautam, Youngsuk Park, Hao Zhou, Parameswaran Raman, and Wooseok Ha. Variance- reduced zeroth-order methods for fine-tuning language models.arXiv preprint arXiv:2404.08080,

  4. [4]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

  5. [5]

    Spectra: Rethinking optimizers for llms under spectral anisotropy.arXiv preprint arXiv:2602.11185,

    Zhendong Huang, Hengjie Cao, Fang Dong, Ruijun Huang, Mengyi Chen, Yifeng Yang, Xin Zhang, Anrui Chen, Mingzhi Dong, Yujiang Wang, et al. Spectra: Rethinking optimizers for llms under spectral anisotropy.arXiv preprint arXiv:2602.11185,

  6. [6]

    Muon: An optimizer for hidden layers in neural networks, 2024.URL https://kellerjordan

    Keller Jordan, Yuchen Jin, Vlado Boza, You Jiacheng, Franz Cesista, Laker Newhouse, and Jeremy Bernstein. Muon: An optimizer for hidden layers in neural networks, 2024.URL https://kellerjordan. github. io/posts/muon, 6(3):4,

  7. [7]

    Adam: A Method for Stochastic Optimization

    Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980,

  8. [8]

    Powering up zeroth-order training via subspace gradient orthogonalization.arXiv preprint arXiv:2602.17155,

    10 Yicheng Lang, Changsheng Wang, Yihua Zhang, Mingyi Hong, Zheng Zhang, Wotao Yin, and Sijia Liu. Powering up zeroth-order training via subspace gradient orthogonalization.arXiv preprint arXiv:2602.17155,

  9. [9]

    Sparse mezo: Less parameters for better performance in zeroth-order llm fine-tuning.arXiv preprint arXiv:2402.15751,

    Yong Liu, Zirui Zhu, Chaoyu Gong, Minhao Cheng, Cho-Jui Hsieh, and Yang You. Sparse mezo: Less parameters for better performance in zeroth-order llm fine-tuning.arXiv preprint arXiv:2402.15751,

  10. [10]

    Leveraging coordinate momentum in signsgd and muon: Memory-optimized zero-order.arXiv preprint arXiv:2506.04430,

    Egor Petrov, Grigoriy Evseev, Aleksey Antonov, Andrey Veprikov, Nikolay Bushkov, Stanislav Moiseev, and Aleksandr Beznosikov. Leveraging coordinate momentum in signsgd and muon: Memory-optimized zero-order.arXiv preprint arXiv:2506.04430,

  11. [11]

    Wic: the word-in-context dataset for evaluating context-sensitive meaning representations

    Mohammad Taher Pilehvar and Jose Camacho-Collados. Wic: the word-in-context dataset for evaluating context-sensitive meaning representations. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 1267–1273,

  12. [12]

    Squad: 100,000+ questions for machine comprehension of text

    Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text. InProceedings of the 2016 conference on empirical methods in natural language processing, pages 2383–2392,

  13. [13]

    Multitask Prompted Training Enables Zero-Shot Task Generalization

    Victor Sanh, Albert Webson, Colin Raffel, Stephen H Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, et al. Multitask prompted training enables zero-shot task generalization.arXiv preprint arXiv:2110.08207,

  14. [14]

    Refining adaptive zeroth-order optimization at ease.arXiv preprint arXiv:2502.01014,

    Yao Shu, Qixin Zhang, Kun He, and Zhongxiang Dai. Refining adaptive zeroth-order optimization at ease.arXiv preprint arXiv:2502.01014,

  15. [15]

    Recursive deep models for semantic compositionality over a sentiment treebank

    Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. InProceedings of the 2013 conference on empirical methods in natural language processing, pages 1631–1642,

  16. [16]

    Gemma 2: Improving Open Language Models at a Practical Size

    Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhu- patiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, et al. Gemma 2: Improving open language models at a practical size, 2024.URL https://arxiv. org/abs/2408.00118, 1(3),

  17. [17]

    Soap: Improving and stabilizing shampoo using adam.arXiv preprint arXiv:2409.11321, 2024

    11 Nikhil Vyas, Depen Morwani, Rosie Zhao, Mujin Kwun, Itai Shapira, David Brandfonbrener, Lucas Janson, and Sham Kakade. Soap: Improving and stabilizing shampoo using adam.arXiv preprint arXiv:2409.11321,

  18. [18]

    OPT: Open Pre-trained Transformer Language Models

    Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. Opt: Open pre-trained transformer language models.arXiv preprint arXiv:2205.01068,

  19. [19]

    Revisiting zeroth-order optimization for memory-efficient llm fine-tuning: A benchmark.arXiv preprint arXiv:2402.11592,

    Yihua Zhang, Pingzhi Li, Junyuan Hong, Jiaxiang Li, Yimeng Zhang, Wenqing Zheng, Pin-Yu Chen, Jason D Lee, Wotao Yin, Mingyi Hong, et al. Revisiting zeroth-order optimization for memory-efficient llm fine-tuning: A benchmark.arXiv preprint arXiv:2402.11592,

  20. [20]

    Second-order fine-tuning without pain for llms: A hessian informed zeroth-order optimizer.arXiv preprint arXiv:2402.15173,

    Yanjun Zhao, Sizhe Dang, Haishan Ye, Guang Dai, Yi Qian, and Ivor W Tsang. Second-order fine-tuning without pain for llms: A hessian informed zeroth-order optimizer.arXiv preprint arXiv:2402.15173,

  21. [21]

    Let the SVD ofM∈R m×n be M=Udiag(σ)V ⊤,(13) and let U[:,1:k] and V[:,1:k] denote the top- k left and right singular vectors

    12 A Analysis of Subspace Power Iteration In Section 4, for computation efficiency, we operate the streaming power iteration (SPI) in the subspace instead of the full space, and in this section, we try to prove that the projection-based SPI is equivalent to the full-space SPI. Let the SVD ofM∈R m×n be M=Udiag(σ)V ⊤,(13) and let U[:,1:k] and V[:,1:k] denot...

  22. [22]

    Our method reaches the highest accuracy of 91.7%, with almost the same wall-clock time and slightly additional memory usage

    We compare our method with ZO baselines through fine-tuning OPT-13B on SST-2. Our method reaches the highest accuracy of 91.7%, with almost the same wall-clock time and slightly additional memory usage. These results show that our method achieves a better accuracy performance without the requirement of extra memory and computation overhead. D Hyperparamet...

  23. [23]

    to ensure the same total query budget with other baselines. For LoRA (Malladi et al., 2023), 17 0 100 200 300 400 500 Wall-Clock Time (s) 60 65 70 75 80 85 90Accuracy (%) k=8 k=16 k=32 k=64 0 100 200 300 400 500 Wall-Clock Time (s) 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0Train Loss k=8 k=16 k=32 k=64 (a) Accuracy vs. wall-clock time (b) Training loss vs. wall-...