pith. machine review for the scientific record. sign in

arxiv: 2605.12801 · v1 · submitted 2026-05-12 · 🧮 math.NA · cs.NA

Recognition: 2 theorem links

· Lean Theorem

Fast and Stable Gradient Approximation for Bilinear Forms of Hermitian Matrix Functions

Authors on Pith no claims yet

Pith reviewed 2026-05-14 19:28 UTC · model grok-4.3

classification 🧮 math.NA cs.NA
keywords Lanczos methodmatrix functionsgradient approximationbilinear formsHermitian matricesnumerical linear algebraKrylov methodsautomatic differentiation
0
0 comments X

The pith

A forward-only gradient approximation reuses the Lanczos pass to stably differentiate bilinear forms u^T f(A(θ))v for Hermitian A with error proportional to the residual norm.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses the challenge of differentiating bilinear forms involving matrix functions of large Hermitian matrices, which appear in scientific computing and probabilistic machine learning. Lanczos efficiently approximates the forward value, but prior differentiation methods either require unstable adjoint passes or extra orthogonalization on augmented systems. The new forward-only method reuses the existing Lanczos vectors and eigenvalues to approximate the gradient, adding minimal overhead. It proves the gradient error scales directly with the Lanczos residual norm and reports unconditional stability in tests without reorthogonalization, outperforming existing approaches in speed.

Core claim

We propose a forward-only gradient approximation that reuses the Lanczos pass and adds very minimal overhead in most cases. We prove that its error is proportional to the Lanczos residual norm, the same quantity controlling the forward approximation. Whereas a traditional adjoint-based calculation would be unstable without reorthogonalization, the new method appears unconditionally stable in our tests. It is also faster than existing state-of-the-art approaches.

What carries the argument

Forward-only gradient approximation that reuses Lanczos vectors and eigenvalues from the forward pass to compute the derivative of the bilinear form without additional orthogonalization or block augmentation.

If this is right

  • The same residual norm tolerance can be used to control accuracy for both the bilinear form value and its gradient.
  • No reorthogonalization is required, avoiding the extra computation and memory traffic of adjoint or augmented-matrix methods.
  • The approach reduces runtime compared with backpropagation through the Lanczos recurrence or Arnoldi on a doubled-size block matrix.
  • It applies directly to large-scale problems where bilinear forms of matrix functions arise in optimization and inference.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the observed stability generalizes, the method could replace custom adjoint implementations in libraries for differentiable linear algebra.
  • Similar reuse of residual information might extend to other Krylov solvers when the matrix is non-Hermitian.
  • The result suggests residual norms could serve as practical error indicators for both primal and sensitivity quantities in iterative matrix computations.

Load-bearing premise

The method assumes A(θ) remains Hermitian for all parameters so Lanczos produces real eigenvalues and orthogonal vectors, and that the forward residual norm reliably bounds the gradient error as well.

What would settle it

Numerical tests on Hermitian matrices where the observed gradient error grows substantially larger than the Lanczos residual norm while using the forward-only method would disprove the error proportionality claim.

Figures

Figures reproduced from arXiv: 2605.12801 by Kipton Barros, Navjot Singh, Xiaoye Sherry Li.

Figure 1
Figure 1. Figure 1: Comparison of our method with the adjoint-based method [ [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: GP training on the Protein dataset with a Matérn- [PITH_FULL_IMAGE:figures/full_fig_p017_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Speedup of the proposed method over modified Arnoldi on graph-sensitivity benchmarks. [PITH_FULL_IMAGE:figures/full_fig_p018_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Relative gradient error versus Krylov depth [PITH_FULL_IMAGE:figures/full_fig_p018_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Relative value error versus Krylov depth [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Hamiltonian-learning convergence as a function of optimization step. Larger Krylov depths [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗
read the original abstract

Objectives involving bilinear forms $u^\top f(A(\theta))v$ for Hermitian $A$ arise widely in scientific computing and probabilistic machine learning. For large matrices, Lanczos efficiently approximates these quantities, but differentiating them with respect to $\theta$ is challenging. Existing approaches either backpropagate through the Lanczos recurrence, requiring reorthogonalization for stability, or apply Arnoldi to an augmented block matrix of twice the original size. Both introduce extra computation and orthogonalization costs that can limit performance on modern hardware. We propose a forward-only gradient approximation that reuses the Lanczos pass and adds very minimal overhead in most cases. We prove that its error is proportional to the Lanczos residual norm, the same quantity controlling the forward approximation. Whereas a traditional adjoint-based calculation would be unstable without reorthogonalization, the new method appears unconditionally stable in our tests. It is also faster than existing state-of-the-art approaches.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes a forward-only gradient approximation for bilinear forms u^T f(A(θ)) v with Hermitian A(θ), reusing the Lanczos pass with minimal overhead. It proves the gradient error is proportional to the Lanczos residual norm and reports that the method appears unconditionally stable in numerical tests, offering advantages over adjoint-based methods that require reorthogonalization.

Significance. If the unconditional stability holds rigorously, this approach could provide a more efficient and stable way to compute gradients in large-scale applications involving matrix functions, such as in probabilistic machine learning and scientific computing. The reuse of the Lanczos pass and the error bound tied to the residual norm are notable strengths, as is the parameter-free nature of the error control.

major comments (2)
  1. [Abstract] Abstract: The assertion that the new method 'appears unconditionally stable in our tests' lacks a supporting theorem, characterization of the admissible class of Hermitian A(θ) and functions f, or analysis of potential failure modes such as clustered spectra or large ||dA/dθ||; this empirical claim is load-bearing for the practical advantage over adjoint methods.
  2. [Proof section] Proof of error bound (likely §3): While the proportionality of gradient error to the Lanczos residual norm is established, the manuscript does not extend the analysis to guarantee stability without reorthogonalization, leaving the central stability advantage as an observation rather than a derived result.
minor comments (2)
  1. [Introduction] Ensure all assumptions on f (e.g., analyticity) are stated explicitly in the introduction and restated in the stability discussion.
  2. [Numerical experiments] Figure captions for numerical tests should include the range of matrix sizes, spectrum types, and function classes tested to allow reproducibility.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback and for recognizing the potential of the forward-only Lanczos gradient approximation. We address the two major comments below, agreeing where the manuscript relies on empirical evidence rather than theorems.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The assertion that the new method 'appears unconditionally stable in our tests' lacks a supporting theorem, characterization of the admissible class of Hermitian A(θ) and functions f, or analysis of potential failure modes such as clustered spectra or large ||dA/dθ||; this empirical claim is load-bearing for the practical advantage over adjoint methods.

    Authors: We agree that the stability statement in the abstract is empirical. In the revision we will rephrase the abstract to read 'numerical experiments suggest that the method remains stable without reorthogonalization' and expand the numerical section with additional experiments on clustered spectra, large ||dA/dθ||, and a wider range of f. A complete theoretical characterization of admissible (A(θ),f) pairs is beyond the current scope but we will add a short discussion of observed failure modes. revision: partial

  2. Referee: [Proof section] Proof of error bound (likely §3): While the proportionality of gradient error to the Lanczos residual norm is established, the manuscript does not extend the analysis to guarantee stability without reorthogonalization, leaving the central stability advantage as an observation rather than a derived result.

    Authors: The referee correctly notes that the error bound is proven while unconditional stability is only observed. We will revise the relevant section to explicitly label the stability claim as an empirical observation, include a brief heuristic argument based on the forward-only structure, and report additional numerical diagnostics. A rigorous proof of stability for arbitrary Hermitian A(θ) and f remains an open question. revision: partial

standing simulated objections not resolved
  • Rigorous theorem guaranteeing unconditional stability without reorthogonalization for general Hermitian A(θ) and functions f

Circularity Check

0 steps flagged

No circularity; error bound and stability claim rest on independent residual norm and empirical tests

full rationale

The derivation proves gradient error proportional to the independently computed Lanczos residual norm (same controller as the forward bilinear-form error) without reducing to a fitted parameter, self-definition, or self-citation chain. No equations rename a known result or smuggle an ansatz via prior work by the same authors. The stability assertion is presented as an empirical observation from tests rather than a theorem derived from the method's own inputs, so the central claims remain self-contained and externally falsifiable via residual norms.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard properties of the Lanczos algorithm for Hermitian matrices and the interpretation of the residual norm as an error indicator; no free parameters are introduced and no new entities are postulated.

axioms (2)
  • domain assumption A(θ) is Hermitian for the relevant range of θ
    Required for Lanczos to generate real Ritz values and orthogonal basis vectors; invoked implicitly throughout the Lanczos-based approximation.
  • standard math The Lanczos residual norm controls the forward approximation error for the bilinear form
    Standard convergence property of the Lanczos method on Hermitian matrices; the paper extends this control to the gradient error.

pith-pipeline@v0.9.0 · 5460 in / 1534 out tokens · 42079 ms · 2026-05-14T19:28:58.810999+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages

  1. [1]

    Al-Mohy and Nicholas J

    Awad H. Al-Mohy and Nicholas J. Higham. Computing the Fréchet derivative of the matrix exponential, with an application to condition number estimation.SIAM Journal on Matrix Analysis and Applications, 30(4):1639–1657, 2009. doi: 10.1137/080716426

  2. [2]

    Learning quantum systems.Nature Reviews Physics, 5:141–156, 2023

    Leonardo Banchi et al. Learning quantum systems.Nature Reviews Physics, 5:141–156, 2023

  3. [3]

    Error estimates and evaluation of matrix functions via the Faber transform.SIAM Journal on Numerical Analysis, 47(5):3849–3883, 2009

    Bernhard Beckermann and Lothar Reichel. Error estimates and evaluation of matrix functions via the Faber transform.SIAM Journal on Numerical Analysis, 47(5):3849–3883, 2009. doi: 10.1137/080741744

  4. [4]

    Matrix functions in network analysis.GAMM-Mitteilungen, 43 (3):e202000012, 2020

    Michele Benzi and Paola Boito. Matrix functions in network analysis.GAMM-Mitteilungen, 43 (3):e202000012, 2020

  5. [5]

    Krylov-aware stochastic trace estimation.SIAM Journal on Matrix Analysis and Applications, 44(3):1218–1244, 2023

    Tyler Chen and Eric Hallman. Krylov-aware stochastic trace estimation.SIAM Journal on Matrix Analysis and Applications, 44(3):1218–1244, 2023. doi: 10.1137/22M1526357

  6. [6]

    A bivariate extension of the Crouzeix–Palencia result with an application to Fréchet derivatives of matrix functions.arXiv preprint arXiv:2007.09784, 2020

    Michel Crouzeix and Daniel Kressner. A bivariate extension of the Crouzeix–Palencia result with an application to Fréchet derivatives of matrix functions.arXiv preprint arXiv:2007.09784, 2020

  7. [7]

    Alexander Davies and Nicholas J. Higham. Computingf(a)b for matrix functions f.SIAM Jour- nal on Matrix Analysis and Applications, 42(2):1084–1105, 2021. doi: 10.1137/20M1363868

  8. [8]

    Communication in complex networks.Applied Numerical Mathematics, 172:186–205, 2022

    Omar De la Cruz Cabrera, Jiafeng Jin, Silvia Noschese, and Lothar Reichel. Communication in complex networks.Applied Numerical Mathematics, 172:186–205, 2022

  9. [9]

    Scal- able log determinants for gaussian process kernel learning

    Kun Dong, David Eriksson, Hannes Nickisch, David Bindel, and Andrew Gordon Wilson. Scal- able log determinants for gaussian process kernel learning. InAdvances in Neural Information Processing Systems, 2017

  10. [10]

    Elman, David J

    Howard C. Elman, David J. Silvester, and Andrew J. Wathen.Finite Elements and Fast Iterative Solvers: With Applications in Incompressible Fluid Dynamics. Oxford University Press, 2 edition, 2014

  11. [11]

    Efficient solution of parabolic equations by Krylov approximation methods.SIAM Journal on Scientific and Statistical Computing, 13(5):1236– 1264, 1992

    Efstratios Gallopoulos and Yousef Saad. Efficient solution of parabolic equations by Krylov approximation methods.SIAM Journal on Scientific and Statistical Computing, 13(5):1236– 1264, 1992. doi: 10.1137/0913071. 10

  12. [12]

    Weinberger, David Bindel, and Andrew Gordon Wilson

    Jacob Gardner, Geoff Pleiss, Kilian Q. Weinberger, David Bindel, and Andrew Gordon Wilson. GPyTorch: Blackbox matrix-matrix gaussian process inference with gpu acceleration. In Advances in Neural Information Processing Systems, 2018

  13. [13]

    Golub and Gérard Meurant.Matrices, Moments and Quadrature with Applications

    Gene H. Golub and Gérard Meurant.Matrices, Moments and Quadrature with Applications. Princeton University Press, 2009

  14. [14]

    Golub and Charles F

    Gene H. Golub and Charles F. Van Loan.Matrix Computations. Johns Hopkins University Press, 4 edition, 2013

  15. [15]

    Practical hamiltonian learning with unitary dynamics and gibbs states

    Jeongwan Haah et al. Practical hamiltonian learning with unitary dynamics and gibbs states. Nature Communications, 2024

  16. [16]

    Higham.Functions of Matrices: Theory and Computation

    Nicholas J. Higham.Functions of Matrices: Theory and Computation. Society for Industrial and Applied Mathematics, Philadelphia, PA, 2008. doi: 10.1137/1.9780898717778

  17. [17]

    On Krylov subspace approximations to the matrix exponential operator.SIAM Journal on Numerical Analysis, 34(5):1911–1925, 1997

    Marlis Hochbruck and Christian Lubich. On Krylov subspace approximations to the matrix exponential operator.SIAM Journal on Numerical Analysis, 34(5):1911–1925, 1997. doi: 10.1137/S0036142995280572

  18. [18]

    Robust and efficient hamiltonian learning.Quantum, 7:1045, 2023

    Hsin-Yuan Huang et al. Robust and efficient hamiltonian learning.Quantum, 7:1045, 2023

  19. [19]

    URLhttps://doi.org/10.1080/03610918908812806

    Michael F. Hutchinson. A stochastic estimator of the trace of the influence matrix for laplacian smoothing splines.Communications in Statistics – Simulation and Computation, 18(3):1059– 1076, 1989. doi: 10.1080/03610918908812806

  20. [20]

    Scalable marginal likelihood estimation for model selection in deep learning

    Alexander Immer, Matthias Bauer, Vincent Fortuin, Gunnar Rätsch, and Mohammad Emtiyaz Khan. Scalable marginal likelihood estimation for model selection in deep learning. In International Conference on Machine Learning, 2021

  21. [21]

    Relton, and Marcel Schweitzer

    Peter Kandolf, Antti Koskela, Samuel D. Relton, and Marcel Schweitzer. Computing low- rank approximations of the Fréchet derivative of a matrix function using krylov subspace methods.SIAM Journal on Matrix Analysis and Applications, 44(3):1115–1139, 2023. doi: 10.1137/22M1503075

  22. [22]

    Gradients of functions of large matrices

    Nicholas Krämer, Pablo Moreno-Muñoz, Hrittik Roy, and Søren Hauberg. Gradients of functions of large matrices. InAdvances in Neural Information Processing Systems, volume 37, 2024

  23. [23]

    A krylov subspace method for the approximation of bivariate matrix functions

    Daniel Kressner. A krylov subspace method for the approximation of bivariate matrix functions. InStructured Matrices in Numerical Linear Algebra, volume 30 ofSpringer INdAM Series, pages 197–214. Springer, Cham, 2019. doi: 10.1007/978-3-030-04088-8_10

  24. [24]

    A novel krylov subspace method for approximating Fréchet derivatives of large-scale matrix functions.arXiv preprint arXiv:2601.21799, 2026

    Daniel Kressner and Peter Oehme. A novel krylov subspace method for approximating Fréchet derivatives of large-scale matrix functions.arXiv preprint arXiv:2601.21799, 2026

  25. [25]

    Deep learning via hessian-free optimization

    James Martens. Deep learning via hessian-free optimization. InProceedings of the 27th International Conference on Machine Learning, pages 735–742, 2010

  26. [26]

    SIAM, 2006

    Gérard Meurant.The Lanczos and Conjugate Gradient Algorithms: From Theory to Finite Precision Computations. SIAM, 2006

  27. [28]

    Igor Najfeld and Timothy F. Havel. Derivatives of the matrix exponential and their computation. Advances in Applied Mathematics, 16(3):321–375, 1995. doi: 10.1006/aama.1995.1017

  28. [29]

    Christopher C. Paige. Computational variants of the lanczos method for the eigenproblem.IMA Journal of Applied Mathematics, 10(3):373–381, 1972

  29. [30]

    Christopher C. Paige. Error analysis of the lanczos algorithm for tridiagonalizing a symmetric matrix.IMA Journal of Applied Mathematics, 18(3):341–349, 1976

  30. [31]

    PhD thesis, University of London, 1971

    Christopher Conway Paige.The Computation of Eigenvalues and Eigenvectors of Very Large Sparse Matrices. PhD thesis, University of London, 1971. 11

  31. [32]

    Parlett.The Symmetric Eigenvalue Problem

    Beresford N. Parlett.The Symmetric Eigenvalue Problem. SIAM, 1998

  32. [33]

    Parlett and David S

    Beresford N. Parlett and David S. Scott. The lanczos algorithm with selective orthogonalization. Mathematics of Computation, 33(145):217–238, 1979

  33. [34]

    On the stability of network indices defined by means of matrix functions.SIAM Journal on Matrix Analysis and Applications, 39(4):1521–1546, 2018

    Stefano Pozza and Francesco Tudisco. On the stability of network indices defined by means of matrix functions.SIAM Journal on Matrix Analysis and Applications, 39(4):1521–1546, 2018. doi: 10.1137/18M1172319

  34. [35]

    Physicochemical properties of protein tertiary structure

    Prashant Rana. Physicochemical properties of protein tertiary structure. UCI Machine Learning Repository, 2013. URLhttps://doi.org/10.24432/C5QW3H

  35. [36]

    Carl Edward Rasmussen and Christopher K. I. Williams.Gaussian Processes for Machine Learning. MIT Press, 2006

  36. [37]

    Chapman and Hall/CRC, 2005

    Håvard Rue and Leonhard Held.Gaussian Markov Random Fields: Theory and Applications. Chapman and Hall/CRC, 2005. doi: 10.1201/9780203492024

  37. [38]

    Analysis of some Krylov subspace approximations to the matrix exponential operator.SIAM Journal on Numerical Analysis, 29(1):209–228, 1992

    Yousef Saad. Analysis of some Krylov subspace approximations to the matrix exponential operator.SIAM Journal on Numerical Analysis, 29(1):209–228, 1992. doi: 10.1137/0729014

  38. [39]

    SIAM, 2 edition, 2003

    Yousef Saad.Iterative Methods for Sparse Linear Systems. SIAM, 2 edition, 2003. doi: 10.1137/1.9780898718003

  39. [40]

    Schraudolph

    Nicol N. Schraudolph. Fast curvature matrix-vector products for second-order gradient descent. Neural Computation, 14(7):1723–1738, 2002. doi: 10.1162/08997660260028683

  40. [41]

    Marcel Schweitzer. Sensitivity of matrix function based network communicability measures: Computational methods and a priori bounds.SIAM Journal on Matrix Analysis and Applications, 44(3):1321–1348, 2023

  41. [42]

    Horst D. Simon. Analysis of the symmetric lanczos algorithm with reorthogonalization methods. Linear Algebra and its Applications, 61:101–131, 1984

  42. [43]

    Fast estimation oftr(f(A)) via stochastic lanczos quadrature.SIAM Journal on Matrix Analysis and Applications, 38(4):1075–1099, 2017

    Shashanka Ubaru, Jie Chen, and Yousef Saad. Fast estimation oftr(f(A)) via stochastic lanczos quadrature.SIAM Journal on Matrix Analysis and Applications, 38(4):1075–1099, 2017. doi: 10.1137/16M1104974. A Proofs for Section 4 This appendix provides more details on the projected sensitivity, and the residual-controlled error formula used in Section 4. We u...