pith. machine review for the scientific record. sign in

arxiv: 2605.11316 · v1 · submitted 2026-05-11 · 💻 cs.LG · math.OC

Recognition: 2 theorem links

· Lean Theorem

Error whitening: Why Gauss-Newton outperforms Newton

Authors on Pith no claims yet

Pith reviewed 2026-05-13 02:04 UTC · model grok-4.3

classification 💻 cs.LG math.OC
keywords error whiteningGauss-Newtonfunction space projectiontangent spaceoptimization dynamicsparameterization effectsHessian approximation
0
0 comments X

The pith

Gauss-Newton descent whitens prediction errors by projecting onto the model's tangent space and replacing JJ^T with the identity, unlike Newton's method.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper adopts a function-space view to explain why Gauss-Newton methods often beat Newton's method despite being viewed as mere Hessian approximations. It shows that the generalized Gauss-Newton matrix projects the optimization direction onto the tangent space of the model predictions. This projection cancels the parameterization-dependent matrix JJ^T and replaces it with the identity. As a result the mismatch between predictions and targets evolves according to the loss structure alone rather than being distorted by how the model is written in parameters. The authors call the cancellation error whitening and demonstrate that it produces the observed performance advantage across several learning settings.

Core claim

The generalized Gauss-Newton matrix projects the Newton direction in function space onto the model's tangent space, while a Jacobian-only variant projects the function space loss gradient onto the same tangent space. Both projections eliminate distortions from the model's parameterization by replacing JJ^T with the identity. This effect is called error whitening. Once the parameterization is removed, the prediction-target mismatch evolves according to dynamics dictated by the structure of the loss and the projection produced by the optimizer. Error whitening is a special property of Gauss-Newton descent that rigorously distinguishes it from Newton's method.

What carries the argument

The function-space projection performed by the generalized Gauss-Newton matrix onto the model's tangent space, which replaces the parameterization matrix JJ^T with the identity.

If this is right

  • The mismatch between model predictions and targets evolves independently of the model's parameterization once the projection is applied.
  • Gauss-Newton optimizers follow the theoretically predicted function-space dynamics in practice.
  • Gauss-Newton descent outperforms Newton's method as well as Adam and Muon on supervised learning, physics-informed deep learning, and approximate dynamic programming tasks.
  • After whitening, optimization dynamics are governed only by the chosen loss and the specific projection induced by the optimizer.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same projection idea could be used to construct new first-order methods that inherit the whitening property without computing second derivatives.
  • In overparameterized regimes where the tangent space is high-dimensional, the whitening effect may become even more pronounced and explain the practical success of Gauss-Newton-style updates.
  • Design choices that preserve or destroy the tangent-space projection could be used to predict optimizer behavior on new loss functions before running large-scale experiments.

Load-bearing premise

The function-space projection analysis accurately captures the dominant dynamics of optimization without higher-order or discretization effects altering the JJ^T cancellation.

What would settle it

A controlled experiment in which the measured evolution of the prediction-target mismatch under Gauss-Newton deviates from the trajectory predicted by setting JJ^T to the identity in a simple non-least-squares loss.

Figures

Figures reproduced from arXiv: 2605.11316 by Brian Wetton, Maricela Best McKay, Nathan P. Lawrence, R. Bhushan Gopaluni.

Figure 1
Figure 1. Figure 1: Evolution of the residual for a loss with [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Function space update directions for each optimizer, visualized as heat maps over the input domain, [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Training loss, evaluation MSE, and cosine similarity to the mismatch and function-space loss gradient, [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The reference Allen-Cahn solution, the PINN approximation achieved when training with [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: The true optimal value function compared against approximations obtained by each optimizer used in [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Function space update directions for GJ , GGN, Newton’s method, and Muon compared against the mismatch and the function space gradient of the loss. Each direction is computed using parameters along a Muon optimization path at different loss values. The panels show a heat map over the domain of the mismatch, the function space gradient of the loss, and the direction each optimizer’s update points in functio… view at source ↗
Figure 7
Figure 7. Figure 7: Function space update directions for GJ , GGN, Newton’s method, and Muon compared against the mismatch. Each direction is computed using parameters along a Muon optimization path, for loss values near 1E-01 and 1E-04. The neural network is being trained using mean log-cosh loss to approximate the function g(x, y) = sin(2πx) sin(2πy) + sin(7πx) sin(7πy) on the unit square. The panels show a heat map over th… view at source ↗
Figure 8
Figure 8. Figure 8: Cosine similarity between each optimizer’s function space update direction and the mismatch [PITH_FULL_IMAGE:figures/full_fig_p023_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Convergence plots on a log-log scale showing training loss (left) and evaluation MSE (right) for [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Function space update directions for GJ , G, H, and Muon, shown alongside the idealized function space update directions for GJ and G: the residual p − y and H † ℓ (p − y), respectively. Directions are computed from parameters along a Muon optimization path on MNIST with cross-entropy loss, at accuracies of 75.33% (top) and 84.67% (bottom). Columns correspond to output classes [0, 9] and rows are samples … view at source ↗
Figure 11
Figure 11. Figure 11: Cosine similarity between the function space directions across all optimizers and the mismatch as a [PITH_FULL_IMAGE:figures/full_fig_p025_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Convergence plots showing training loss (left) and accuracy on a held back test data set (right) for [PITH_FULL_IMAGE:figures/full_fig_p026_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Companion figure to Fig [PITH_FULL_IMAGE:figures/full_fig_p030_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Performance as a function of dataset size. Solid line is the median and shaded regions are the min-max [PITH_FULL_IMAGE:figures/full_fig_p031_14.png] view at source ↗
read the original abstract

The Gauss-Newton matrix is widely viewed as a positive semidefinite approximation of the Hessian, yet mounting empirical evidence shows that Gauss-Newton descent outperforms Newton's method. We adopt a function space perspective to analyze this phenomenon. We show that the generalized Gauss-Newton (GGN) matrix projects the Newton direction in function space onto the model's tangent space, while a Jacobian-only variant obtained by applying the least squares Gauss-Newton matrix to non-least squares losses projects the function space loss gradient onto this same tangent space. Both projections eliminate distortions from the model's parameterization. Specifically, the evolution of the prediction-target mismatch depends on the model's parameterization through the matrix $JJ^\top$ where $J$ is the Jacobian of the model with respect to its parameters. The projections effectively replace $JJ^\top$ with the identity. We call this effect error whitening. Once the parameterization is removed, the prediction-target mismatch evolves according to dynamics dictated by the structure of the loss and the projection produced by the optimizer. Error whitening is a special property of Gauss-Newton descent that rigorously distinguishes it from Newton's method. We empirically demonstrate that Gauss-Newton optimizers follow the theoretically predicted function space dynamics and outperforms Newton's method, Adam, and Muon across case studies spanning supervised learning, physics-informed deep learning, and approximate dynamic programming.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that Gauss-Newton (GN) descent outperforms Newton's method because the generalized Gauss-Newton matrix projects the Newton direction (or loss gradient for a Jacobian-only variant) onto the model's tangent space in function space. This 'error whitening' effect replaces the parameterization-dependent JJ^T with the identity in the evolution of the prediction-target mismatch, so that dynamics depend only on loss structure and the projection. The analysis is supported by derivations and empirical trajectory matching showing GN superiority over Newton, Adam, and Muon in supervised learning, physics-informed networks, and approximate dynamic programming.

Significance. If the function-space projection view holds, the work supplies a principled distinction between GN and Newton that goes beyond the usual positive-semidefinite approximation narrative, potentially informing second-order optimizer design in deep learning. The empirical demonstrations of predicted dynamics are a concrete strength.

major comments (2)
  1. The central claim that error whitening 'rigorously distinguishes' GN from Newton rests on the projection analysis capturing dominant dynamics. The manuscript should explicitly bound or test the effect of higher-order terms in the model expansion and finite-step discretization on the JJ^T cancellation, especially in the finite-width non-convex regimes of the case studies (as the weakest assumption in the provided analysis). Without this, the distinction may not transfer as stated.
  2. Handling of non-least-squares losses via the Jacobian-only variant: the abstract sketches the projection but the manuscript lacks visible details on proof completeness for this case; a self-contained derivation or counter-example would be needed to support the general claim.
minor comments (2)
  1. Notation: the early definition and consistent use of J (Jacobian) and JJ^T should be clarified with a small example to aid readers new to the function-space perspective.
  2. Empirical sections: adding error bars or multiple random seeds to the trajectory-matching plots would improve readability and statistical clarity without altering the core results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important aspects of the analysis that we have addressed through revisions to improve clarity and rigor.

read point-by-point responses
  1. Referee: The central claim that error whitening 'rigorously distinguishes' GN from Newton rests on the projection analysis capturing dominant dynamics. The manuscript should explicitly bound or test the effect of higher-order terms in the model expansion and finite-step discretization on the JJ^T cancellation, especially in the finite-width non-convex regimes of the case studies (as the weakest assumption in the provided analysis). Without this, the distinction may not transfer as stated.

    Authors: We agree that the projection analysis relies on a first-order model expansion and that higher-order terms, along with finite discretization effects, could in principle affect the exact cancellation of JJ^T. Our empirical trajectory-matching experiments already demonstrate close agreement with the predicted dynamics in the finite-width non-convex settings of the case studies, suggesting the leading-order effect remains dominant. In the revision we have added a dedicated subsection that derives the leading remainder term in the Taylor expansion and reports additional numerical tests that vary step size and measure the resulting deviation from the idealized linear dynamics. While a fully general, tight bound for arbitrary non-convex finite-width networks lies beyond the scope of the present work, these additions make the domain of validity of the distinction explicit. revision: partial

  2. Referee: Handling of non-least-squares losses via the Jacobian-only variant: the abstract sketches the projection but the manuscript lacks visible details on proof completeness for this case; a self-contained derivation or counter-example would be needed to support the general claim.

    Authors: We thank the referee for noting the need for greater detail on the Jacobian-only variant. The algebraic steps showing that the least-squares GGN applied to a general loss gradient yields the tangent-space projection (and thereby replaces JJ^T with the identity) were present but not fully expanded. The revised manuscript now contains a self-contained derivation in the main text that walks through each matrix identity and the resulting evolution equation for the prediction-target mismatch. Because the derivation is algebraic and holds under the stated assumptions on the loss and Jacobian, a counter-example is not required; we have added a short remark clarifying the assumptions and their implications for non-least-squares objectives. revision: yes

Circularity Check

0 steps flagged

No circularity: function-space projection analysis is self-contained linear algebra

full rationale

The derivation begins from the standard definitions of the Newton and Gauss-Newton updates, applies the function-space gradient and Jacobian projection operators, and shows algebraically that the GGN step replaces JJ^T by the identity in the evolution equation for the prediction-target mismatch. This replacement follows directly from the matrix forms of the two optimizers and does not rely on any fitted parameter, self-citation chain, or ansatz imported from prior work by the same authors. The subsequent claim that this constitutes 'error whitening' is a naming of the derived identity rather than a redefinition that forces the result. Empirical sections compare observed trajectories to the predicted dynamics but do not feed fitted quantities back into the theoretical statements. The analysis therefore remains independent of its own outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The analysis rests on standard linear algebra (orthogonal projections onto range of J) and the definition of function-space gradients; no free parameters are introduced, and the only invented entity is the named phenomenon itself.

axioms (2)
  • domain assumption The local behavior of the model is captured by its Jacobian J, and updates act in the tangent space spanned by its columns.
    Invoked when defining the projection that cancels JJ^T.
  • domain assumption The evolution of the prediction-target mismatch is governed by the composition of the loss gradient with the model Jacobian.
    Central to deriving the dynamics before and after projection.
invented entities (1)
  • error whitening no independent evidence
    purpose: Name for the effect in which the projection replaces JJ^T by the identity matrix.
    New descriptive term coined in the paper; no independent empirical handle supplied beyond the theoretical derivation.

pith-pipeline@v0.9.0 · 5535 in / 1467 out tokens · 62497 ms · 2026-05-13T02:04:42.171700+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

76 extracted references · 76 canonical work pages · 2 internal anchors

  1. [1]

    The Potential of Second-Order Optimization for LLMs: A Study with Full Gauss-Newton

    Natalie Abreu, Nikhil Vyas, Sham Kakade, and Depen Morwani. The potential of second-order optimization for llms: A study with full gauss-newton.arXiv preprint arXiv:2510.09378, 2025

  2. [2]

    arXiv preprint arXiv:2002.09018 , year=

    Rohan Anil, Vineet Gupta, Tomer Koren, Kevin Regan, and Yoram Singer. Scalable second order optimization for deep learning.arXiv preprint arXiv:2002.09018, 2021

  3. [3]

    Bellemare, Will Dabney, and Mark Rowland.Distributional Reinforcement Learning

    Marc G. Bellemare, Will Dabney, and Mark Rowland.Distributional Reinforcement Learning. Adaptive Computation and Machine Learning. The MIT Press, Cambridge London, 2023

  4. [4]

    Athena scientific, 2012

    Dimitri Bertsekas.Dynamic Programming and Optimal Control: Volume I, volume 4. Athena scientific, 2012

  5. [5]

    Athena Scientific, 2022

    Dimitri Bertsekas.Lessons from AlphaZero for Optimal, Model Predictive, and Adaptive Control. Athena Scientific, 2022

  6. [6]

    Athena Scientific, 1996

    Dimitri Bertsekas and John N Tsitsiklis.Neuro-Dynamic Programming. Athena Scientific, 1996

  7. [7]

    Near-optimal sketchy natural gradients for physics-informed neural networks

    Maricela Best Mckay, Avleen Kaur, Chen Greif, and Brian Wetton. Near-optimal sketchy natural gradients for physics-informed neural networks. InForty-Second International Conference on Machine Learning, 2025

  8. [8]

    SIAM, 2024

    Åke Björck.Numerical methods for least squares problems. SIAM, 2024

  9. [9]

    JAX: composable transformations of Python+NumPy programs, 2018

    James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Yash Katariya, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman- Milne, and Qiao Zhang. JAX: composable transformations of Python+NumPy programs, 2018

  10. [10]

    Hessian Matrix vs

    Pei Chen. Hessian Matrix vs. Gauss–Newton Hessian matrix.SIAM Journal on Numerical Analysis, 49(4):1417–1435, 2011

  11. [11]

    Identifying and attacking the saddle point problem in high-dimensional non- convex optimization.Advances in neural information processing systems, 27, 2014

    Yann N Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, and Yoshua Bengio. Identifying and attacking the saddle point problem in high-dimensional non- convex optimization.Advances in neural information processing systems, 27, 2014

  12. [12]

    Mitigating propagation failures in physics-informed neural networks using retain-resample-release (r3) sampling

    Arka Daw, Jie Bu, Sifan Wang, Paris Perdikaris, and Anuj Karpatne. Mitigating propagation failures in physics-informed neural networks using retain-resample-release (r3) sampling. In Proceedings of the 40th International Conference on Machine Learning, pages 7264–7302, 2023

  13. [13]

    A method for representing periodic functions and enforcing exactly periodic boundary conditions with deep neural networks.Journal of Computational Physics, 435:110242, 2021

    Suchuan Dong and Naxian Ni. A method for representing periodic functions and enforcing exactly periodic boundary conditions with deep neural networks.Journal of Computational Physics, 435:110242, 2021

  14. [14]

    Differentiable predictive control: Deep learning alternative to explicit model predictive control for unknown nonlinear systems.Journal of Process Control, 116:80–92, 2022

    Ján Drgo ˇna, Karol Kiš, Aaron Tuor, Draguna Vrabie, and Martin Klau ˇco. Differentiable predictive control: Deep learning alternative to explicit model predictive control for unknown nonlinear systems.Journal of Process Control, 116:80–92, 2022. 10

  15. [15]

    NeuroMANCER: Neural Modules with Adaptive Nonlinear Constraints and Efficient Regularizations

    Jan Drgona, Aaron Tuor, James Koch, Madelyn Shapiro, Bruno Jacob, and Draguna Vra- bie. NeuroMANCER: Neural Modules with Adaptive Nonlinear Constraints and Efficient Regularizations. 2023

  16. [16]

    John Wiley & Sons, 2013

    Roger Fletcher.Practical methods of optimization. John Wiley & Sons, 2013

  17. [17]

    A stable whitening optimizer for efficient neural network training

    Kevin Frans, Sergey Levine, and Pieter Abbeel. A stable whitening optimizer for efficient neural network training. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems

  18. [18]

    Addressing function approximation error in actor-critic methods

    Scott Fujimoto, Herke van Hoof, and David Meger. Addressing function approximation error in actor-critic methods. InProceedings of the 35th International Conference on Machine Learning, volume 80 ofProceedings of Machine Learning Research, pages 1587–1596. PMLR, 2018

  19. [19]

    Carnegie Mellon University, 1999

    Geoffrey J Gordon.Approximate Solutions to Markov Decision Processes. Carnegie Mellon University, 1999

  20. [20]

    Deep learning alternatives of the kolmogorov superposition theorem.arXiv preprint arXiv:2410.01990, 2024

    Leonardo Ferreira Guilhoto and Paris Perdikaris. Deep learning alternatives of the kolmogorov superposition theorem.arXiv preprint arXiv:2410.01990, 2024

  21. [21]

    Shampoo: Preconditioned stochastic tensor optimization

    Vineet Gupta, Tomer Koren, and Yoram Singer. Shampoo: Preconditioned stochastic tensor optimization. InProceedings of the 35th International Conference on Machine Learning, volume 80 ofProceedings of Machine Learning Research, pages 1842–1850. PMLR, 2018

  22. [22]

    Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor

    Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. InProceedings of the 35th International Conference on Machine Learning, volume 80 ofProceedings of Machine Learning Research, pages 1861–1870. PMLR, 2018

  23. [23]

    Double Q-learning.Advances in neural information processing systems, 23, 2010

    Hado Hasselt. Double Q-learning.Advances in neural information processing systems, 23, 2010

  24. [24]

    Rainbow: Combining Improvements in Deep Reinforcement Learning, 2017

    Matteo Hessel, Joseph Modayil, Hado van Hasselt, Tom Schaul, Georg Ostrovski, Will Dab- ney, Dan Horgan, Bilal Piot, Mohammad Azar, and David Silver. Rainbow: Combining Improvements in Deep Reinforcement Learning, 2017

  25. [25]

    The 37 implementation details of proximal policy optimization

    Shengyi Huang, Rousslan Fernand Julien Dossa, Antonin Raffin, Anssi Kanervisto, and Weixun Wang. The 37 implementation details of proximal policy optimization. InICLR Blog Track, 2022

  26. [26]

    Gauss-newton natural gradient descent for physics-informed computational fluid dynamics.Computers & Fluids, page 106955, 2025

    Anas Jnini, Flavio Vella, and Marius Zeinhofer. Gauss-newton natural gradient descent for physics-informed computational fluid dynamics.Computers & Fluids, page 106955, 2025

  27. [27]

    Pathological spectra of the fisher informa- tion metric and its variants in deep neural networks.Neural Computation, 33(8):2274–2307, 2021

    Ryo Karakida, Shotaro Akaho, and Shun-ichi Amari. Pathological spectra of the fisher informa- tion metric and its variants in deep neural networks.Neural Computation, 33(8):2274–2307, 2021

  28. [28]

    Adam: A Method for Stochastic Optimization

    Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014

  29. [29]

    Characterizing possible failure modes in physics-informed neural networks.Advances in neural information processing systems, 34:26548–26560, 2021

    Aditi Krishnapriyan, Amir Gholami, Shandian Zhe, Robert Kirby, and Michael W Mahoney. Characterizing possible failure modes in physics-informed neural networks.Advances in neural information processing systems, 34:26548–26560, 2021

  30. [30]

    Scalable optimization in the modular norm.Advances in Neural Information Processing Systems, 37:73501–73548, 2024

    Tim Large, Yang Liu, Minyoung Huh, Hyojin Bahng, Phillip Isola, and Jeremy Bernstein. Scalable optimization in the modular norm.Advances in Neural Information Processing Systems, 37:73501–73548, 2024

  31. [31]

    Gradient-based learning applied to document recognition.Proceedings of the IEEE, 86(11):2278–2324, 1998

    Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition.Proceedings of the IEEE, 86(11):2278–2324, 1998

  32. [32]

    Gradient-based learning applied to document recognition.Proceedings of the IEEE, 86(11):2278–2324, 2002

    Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition.Proceedings of the IEEE, 86(11):2278–2324, 2002. 11

  33. [33]

    Lillicrap, Jonathan J

    Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. 2015

  34. [34]

    Carnegie Mellon University, 1992

    Long-Ji Lin.Reinforcement Learning for Robots Using Neural Networks. Carnegie Mellon University, 1992

  35. [35]

    On the limited memory BFGS method for large scale optimiza- tion.Mathematical programming, 45(1):503–528, 1989

    Dong C Liu and Jorge Nocedal. On the limited memory BFGS method for large scale optimiza- tion.Mathematical programming, 45(1):503–528, 1989

  36. [36]

    Decoupled weight decay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations, 2019

  37. [37]

    Understanding SOAP from the perspective of gradient whitening.arXiv preprint arXiv:2509.22938, 2025

    Yanqing Lu, Letao Wang, and Jinbo Liu. Understanding SOAP from the perspective of gradient whitening.arXiv preprint arXiv:2509.22938, 2025

  38. [38]

    New insights and perspectives on the natural gradient method.Journal of Machine Learning Research, 21(146):1–76, 2020

    James Martens. New insights and perspectives on the natural gradient method.Journal of Machine Learning Research, 21(146):1–76, 2020

  39. [39]

    Optimizing neural networks with kronecker-factored approx- imate curvature

    James Martens and Roger Grosse. Optimizing neural networks with kronecker-factored approx- imate curvature. InInternational conference on machine learning, pages 2408–2417. PMLR, 2015

  40. [40]

    Asynchronous methods for deep reinforcement learning.arXiv preprint arXiv:1602.01783,

    V olodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Timothy P. Lilli- crap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous Methods for Deep Reinforcement Learning.arXiv:1602.01783 [cs], 2016

  41. [41]

    Playing Atari with Deep Reinforcement Learning, 2013

    V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing Atari with Deep Reinforcement Learning, 2013

  42. [42]

    A new perspective on shampoo’s preconditioner

    Depen Morwani, Itai Shapira, Nikhil Vyas, Eran Malach, Sham Kakade, and Lucas Janson. A new perspective on shampoo’s preconditioner. InInternational Conference on Learning Representations, 2025

  43. [43]

    Achieving high accuracy with PINNs via energy natural gradient descent

    Johannes Müller and Marius Zeinhofer. Achieving high accuracy with PINNs via energy natural gradient descent. InInternational Conference on Machine Learning, pages 25471–25485. PMLR, 2023

  44. [44]

    Springer, 2006

    Jorge Nocedal and Stephen J Wright.Numerical optimization. Springer, 2006

  45. [45]

    PyTorch: An imperative style, high-performance deep learning library.Advances in neural information processing systems, 32, 2019

    Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. PyTorch: An imperative style, high-performance deep learning library.Advances in neural information processing systems, 32, 2019

  46. [46]

    Powell.Approximate Dynamic Programming: Solving the Curses of Dimensionality

    Warren B. Powell.Approximate Dynamic Programming: Solving the Curses of Dimensionality. Wiley Series in Probability and Statistics. Wiley, Hoboken, N.J, 2nd ed edition, 2011

  47. [47]

    Raissi, P

    M. Raissi, P. Perdikaris, and G. E. Karniadakis. Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations.Journal of Computational Physics, 378:686–707, February 2019

  48. [48]

    Neural Fitted Q Iteration – First Experiences with a Data Efficient Neural Reinforcement Learning Method

    Martin Riedmiller. Neural Fitted Q Iteration – First Experiences with a Data Efficient Neural Reinforcement Learning Method. InMachine Learning: ECML 2005, volume 3720, pages 317–328. Springer Berlin Heidelberg, Berlin, Heidelberg, 2005

  49. [49]

    Rohrhofer, Stefan Posch, Clemens Gößnitzer, and Bernhard C Geiger

    Franz M. Rohrhofer, Stefan Posch, Clemens Gößnitzer, and Bernhard C Geiger. On the role of fixed points of dynamical systems in training physics-informed neural networks.Transactions on Machine Learning Research, 2022

  50. [50]

    Prioritized Experience Replay

    Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized experience replay. arXiv preprint arXiv:1511.05952, 2015. 12

  51. [51]

    Trust region policy optimization

    John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. InProceedings of the 32nd International Conference on Machine Learning, volume 37 ofProceedings of Machine Learning Research, pages 1889–1897, Lille, France, 2015. PMLR

  52. [52]

    Proximal policy optimization algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. 2017

  53. [53]

    A differential equation for modeling nesterov’s accelerated gradient method: Theory and insights.Journal of Machine Learning Research, 17(153):1–43, 2016

    Weijie Su, Stephen Boyd, and Emmanuel J Candes. A differential equation for modeling nesterov’s accelerated gradient method: Theory and insights.Journal of Machine Learning Research, 17(153):1–43, 2016

  54. [54]

    Richard S. Sutton. Learning to predict by the methods of temporal differences.Machine Learning, 3(1):9–44, 1988

  55. [55]

    Sutton and Andrew G

    Richard S. Sutton and Andrew G. Barto.Reinforcement Learning: An Introduction. Adaptive Computation and Machine Learning Series. The MIT Press, Cambridge, Massachusetts, second edition edition, 2018

  56. [56]

    Russ Tedrake.Underactuated Robotics. 2023

  57. [57]

    SOAP: Improving and stabilizing shampoo using adam

    Nikhil Vyas, Depen Morwani, Rosie Zhao, Mujin Kwun, Itai Shapira, David Brandfonbrener, Lucas Janson, and Sham Kakade. SOAP: Improving and stabilizing shampoo using adam. In International Conference on Learning Representations, 2025

  58. [58]

    Gradient alignment in physics- informed neural networks: a second-order optimization perspective

    Sifan Wang, Ananyae Kumar Bhartari, Bowen Li, and Paris Perdikaris. Gradient alignment in physics-informed neural networks: A second-order optimization perspective.arXiv preprint arXiv:2502.00604, 2025

  59. [59]

    Piratenets: Physics-informed deep learning with residual adaptive networks.Journal of Machine Learning Research, 25(402):1–51, 2024

    Sifan Wang, Bowen Li, Yuhan Chen, and Paris Perdikaris. Piratenets: Physics-informed deep learning with residual adaptive networks.Journal of Machine Learning Research, 25(402):1–51, 2024

  60. [60]

    Respecting causality is all you need for training physics-informed neural networks

    Sifan Wang, Shyam Sankaran, and Paris Perdikaris. Respecting causality is all you need for training physics-informed neural networks.arXiv preprint arXiv:2203.07404, 2022

  61. [61]

    An expert’s guide to training physics-informed neural networks.arXiv preprint arXiv:2308.08468, 2023

    Sifan Wang, Shyam Sankaran, Hanwen Wang, and Paris Perdikaris. An expert’s guide to training physics-informed neural networks.arXiv preprint arXiv:2308.08468, 2023

  62. [62]

    Understanding and mitigating gradient flow pathologies in physics-informed neural networks.SIAM Journal on Scientific Computing, 43(5):A3055–A3081, January 2021

    Sifan Wang, Yujun Teng, and Paris Perdikaris. Understanding and mitigating gradient flow pathologies in physics-informed neural networks.SIAM Journal on Scientific Computing, 43(5):A3055–A3081, January 2021

  63. [63]

    When and why PINNs fail to train: A neural tangent kernel perspective.Journal of Computational Physics, 449:110768, January 2022

    Sifan Wang, Xinling Yu, and Paris Perdikaris. When and why PINNs fail to train: A neural tangent kernel perspective.Journal of Computational Physics, 449:110768, January 2022

  64. [64]

    Dueling network architectures for deep reinforcement learning

    Ziyu Wang, Tom Schaul, Matteo Hessel, Hado Hasselt, Marc Lanctot, and Nando Freitas. Dueling network architectures for deep reinforcement learning. InInternational Conference on Machine Learning, pages 1995–2003. PMLR, 2016

  65. [65]

    Learning in sinusoidal spaces with physics-informed neural networks.IEEE Transactions on Artificial Intelligence, 5(3):985–1000, 2022

    Jian Cheng Wong, Chin Chun Ooi, Abhishek Gupta, and Yew-Soon Ong. Learning in sinusoidal spaces with physics-informed neural networks.IEEE Transactions on Artificial Intelligence, 5(3):985–1000, 2022

  66. [66]

    Principal whitened gradient for information geometry

    Zhirong Yang and Jorma Laaksonen. Principal whitened gradient for information geometry. Neural Networks, 21(2-3):232–240, 2008. 13 A Identities and useful calculations Lemma A.1.For any matrix M∈R m×n with rank r, (M ⊤M) †M ⊤ =M †. This is a standard well known result included here simply for completeness. Proof. Consider the singular value decomposition ...

  67. [67]

    0σ −2 r   | {z } r×rnonzero block 0|{z} r×(n−r) 0|{z} (n−r)×r 0|{z} (n−r)×(n−r)       σ1 0· · ·0 0σ 2

  68. [68]

    0σ r   | {z } r×rnonzero block 0|{z} r×(m−r) 0|{z} (n−r)×r 0|{z} (n−r)×(m−r)   U ⊤ =V     1/σ1 0· · ·0 0 1/σ 2

  69. [69]

    14 Corollary A.2

    0 1/σ r   | {z } r×rnonzero block 0|{z} r×(m−r) 0|{z} (n−r)×r 0|{z} (n−r)×(m−r)   U ⊤ =VΣ †U ⊤ =M †. 14 Corollary A.2. M M† =UΣV ⊤VΣ †U ⊤ =UΣΣ †U ⊤ =U     σ1 0· · ·0 0σ 2

  70. [70]

    0σ r   | {z } r×rnonzero block 0|{z} r×(n−r) 0|{z} (m−r)×r 0|{z} (m−r)×(n−r)       1/σ1 0· · ·0 0 1/σ 2

  71. [71]

    0 1/σ r   | {z } r×rnonzero block 0|{z} r×(m−r) 0|{z} (n−r)×r 0|{z} (n−r)×(m−r)   U ⊤ =U     1 0· · ·0 0 1

  72. [72]

    Proposition A.3.Equivalence of writing (G)†∇θL(θ) as a product of summed matrices vs as stacked matrix vector products, i.e

    0 1   | {z } r×rIdentity block 0|{z} r×(m−r) 0|{z} (m−r)×r 0|{z} (m−r)×(m−r)   U ⊤ = Ur|{z} m×r 0|{z} (m)×(m−r) U ⊤ = rX i=1 U[:,i]U ⊤ [:,i]. Proposition A.3.Equivalence of writing (G)†∇θL(θ) as a product of summed matrices vs as stacked matrix vector products, i.e. J 1 d dX i=1 J ⊤ i ∇2 fθ ℓiJi !† 1 d dX i=1 J ⊤ i ∇fθ ℓi ! =J J ⊤∇2 fθ ℓ(...

  73. [73]

    ∇2 fθ ℓd −1/2 ∇fθ ℓd   | {z } dk×1

    0 1   | {z } r×rIdentity block 0|{z} r×(dk−r) 0|{z} (dk−r)×r 0|{z} (dk−r)×(dk−r)   U ⊤ 1 d   ∇2 fθ ℓ1 −1/2 ∇fθ ℓ1 ... ∇2 fθ ℓd −1/2 ∇fθ ℓd   | {z } dk×1 . So, U ⊤ ∇2 fθ ℓ(fθ) 1/2 dfθ dτ =− Ir 0 0 0 U ⊤ 1 d   ∇2 fθ ℓ1 −1/2 ∇fθ ℓ1 ... ∇2 fθ ℓd −1/2 ∇fθ ℓd   | {z } dk×1 . 16 ∇2 fθ ℓ(fθ) −1 ∇fθ ℓ(fθ) = 1 d  ...

  74. [74]

    It is the cosine of the angle between v and the subspace Im(J)

    This tells us how much of v is in Im(J). It is the cosine of the angle between v and the subspace Im(J). If the ratio is 1, then all of v is in the subspace; if it is 0, then none ofvis in the subspace. A similar computation yields the reachability of the GGN. Here, we need to compute in the weighted norm 1 d J ⊤HℓJ v Hℓ : 1 d J(J ⊤HℓJ) †J ⊤v 2 Hℓ = X i d...

  75. [75]

    Gauss–Newton descent corresponds to the Newton update direction in function space, restricted to directions reachable through parameter updates

  76. [76]

    switching

    Moreover, this direction is the unique minimizer of min v∈Im(∇θfθ) 1 2 v+H −1 ℓ ∇fθ ℓ(fθ) 2 Hℓ . Proof. Let Ji :=∇ θfθ(xi) and J= [ J1, . . . , J d]⊤ | {z } dk×p denote the vectorized matrix containing each sample Jacobian. Note that the vectorized Hessian ∇2 fθ ℓ(fθ) | {z } dk×dk = 1 d   ∇2 fθ ℓ1 | {z } k×k 0· · ·0 0∇ 2 fθ ℓ2 . . . ... ... . ....