Inversion-Free Natural Gradient Descent on Riemannian Manifolds

Dario Draca; Minh-Ngoc Tran; Takuo Matsubara

arxiv: 2604.02969 · v2 · pith:2XZDIIJNnew · submitted 2026-04-03 · 📊 stat.ML · cs.LG· stat.CO· stat.ME

Inversion-Free Natural Gradient Descent on Riemannian Manifolds

Dario Draca , Takuo Matsubara , Minh-Ngoc Tran This is my paper

Pith reviewed 2026-05-13 18:12 UTC · model grok-4.3

classification 📊 stat.ML cs.LGstat.COstat.ME

keywords natural gradientRiemannian manifoldstochastic optimizationFisher information matrixonline approximationconvergence ratevariational Bayesnormalizing flows

0 comments

The pith

An inversion-free natural gradient method on Riemannian manifolds updates an online approximation to the inverse Fisher information matrix using transported score vectors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a stochastic natural gradient descent algorithm that works directly on Riemannian manifolds without needing to invert the Fisher information matrix at each step. It maintains a low-cost online estimate of the inverse FIM by incorporating score vectors from current and past iterates after transporting them to a common tangent space. This yields almost-sure convergence of the squared distance to the optimum at rate O(log s / s^α) provided the step-size schedule has exponent α exceeding 2/3, while the approximation error in the FIM accumulates from the transport steps. A limited-memory version is also given to control storage, and the approach is tested on variational Bayes problems with Gaussian and flow-based approximations where manifold geometry enforces constraints naturally.

Core claim

The central discovery is an inversion-free update for the approximate inverse FIM on the manifold that combines transported score vectors at quadratic cost per iteration and delivers almost-sure convergence rates O(log s / s^α) for the parameter distance when α > 2/3, together with corresponding rates for the FIM approximation itself.

What carries the argument

The transport-based quadratic update rule for the running inverse-FIM approximation, which moves score vectors between successive tangent spaces to avoid explicit inversion.

If this is right

The squared distance to the minimizer converges almost surely at O(log s / s^α) for step-size exponents α > 2/3.
The approximate FIM converges at almost-sure rates despite the added transport errors.
The limited-memory variant achieves the same rates with sub-quadratic storage.
The method applies directly to constrained problems such as positive-definite covariance estimation and orthogonal parameters in normalizing flows.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar transport-based updates could be applied to other online matrix approximations on manifolds beyond the FIM.
The method may allow natural extension to non-Euclidean parameter spaces in deep learning models that already use Riemannian optimization.
Choosing manifolds with closed-form transport maps would minimize the error accumulation shown in the rates.
Testing on higher-dimensional manifolds could reveal whether the quadratic cost remains practical at scale.

Load-bearing premise

The transport operations between tangent spaces must be accurate enough that their accumulated errors do not invalidate the convergence rates, and the manifold must be regular enough for the stochastic approximation to converge.

What would settle it

A numerical counter-example on a simple manifold such as the positive-definite cone where large transport approximation error causes the iterates to diverge or fail to achieve the predicted rate.

Figures

Figures reproduced from arXiv: 2604.02969 by Dario Draca, Minh-Ngoc Tran, Takuo Matsubara.

**Figure 2.** Figure 2: Comparison of approximate/exact natural gradient algorithms on larger datasets. [PITH_FULL_IMAGE:figures/full_fig_p022_2.png] view at source ↗

**Figure 3.** Figure 3: Negative lower bound plots of the three VB methods [PITH_FULL_IMAGE:figures/full_fig_p024_3.png] view at source ↗

read the original abstract

The natural gradient method is a central tool for statistical optimisation, but its broader application is hindered by the assumption of a Euclidean parameter space, the repeated estimation of the Fisher information matrix (FIM), and the computational cost of its subsequent inversion. This paper proposes an intrinsic, inversion-free natural gradient method for statistical models whose parameters lie on general Riemannian manifolds. Formulating statistical optimisation in this non-Euclidean setting allows for the natural enforcement of parameter constraints, the elimination of non-identifiable parameters, and the exploitation of geodesic convexity. Our algorithm is based on a moving approximation of the inverse FIM, which is maintained directly on the manifold. This approximation is efficiently updated with new score vectors using low-rank matrix identities. We prove almost-sure convergence rates of $O(\log s / s^\alpha)$ for the sequence of iterates, and a similar rate for the approximate FIM. A limited-memory variant with sub-quadratic storage complexity is further proposed for large-scale applications. We demonstrate the efficacy of our method on variational Bayes within the Bures-Wasserstein manifold, normalising flows on the Stiefel manifold, and reduced-rank logistic regression.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a practical inversion-free natural gradient on Riemannian manifolds with almost-sure rates, but the transport error bounds are the part that needs the closest check.

read the letter

Hi, the core contribution here is a stochastic natural gradient update that skips matrix inversion on a manifold by maintaining an online quadratic-cost approximation to the inverse Fisher information matrix. Score vectors from different tangent spaces get combined via transport, and the authors prove almost-sure rates of O(log s / s^α) for squared distance to the minimizer when α > 2/3, plus rates for the approximate FIM itself that explicitly include the transport errors. A limited-memory version is also given, and they test it on variational Bayes with Gaussian approximations and normalizing flows, where it beats the Euclidean baseline in the reported runs. The manifold framing is useful because it lets constraints like positive definiteness be enforced automatically without extra penalties or projections. The online update and the explicit rates are the parts that feel new relative to standard Euclidean natural gradient work. The transport step is handled correctly in principle by using parallel transport, and the paper is upfront that the FIM approximation now carries accumulated transport errors. The main soft spot is whether those errors stay controlled enough for the claimed rates to survive, especially near α = 2/3. If each transport discrepancy is order of the step size, the summed error is order s^{1-α}, and the supermartingale argument has to absorb it without losing the rate; the abstract says the rates hold anyway, so the proof presumably closes the bound, but that is the section I would read first. Geodesic convexity is also assumed, which is standard but restricts the settings where the result applies directly. This is for people doing stochastic optimization or variational inference on constrained parameter spaces. It has enough algorithmic novelty and theoretical content to deserve a serious referee, even if the error analysis will probably need tightening in revision. I would send it to peer review.

Referee Report

1 major / 2 minor

Summary. The paper introduces an inversion-free stochastic natural gradient descent algorithm for optimization over Riemannian manifolds, particularly for probability distributions. It maintains an online approximation to the inverse Fisher information matrix by updating with score vectors that are parallel-transported between tangent spaces at successive iterates. The central theoretical claims are almost-sure convergence rates of O(log s / s^α) for the squared distance to the minimizer when the step-size exponent α > 2/3, together with almost-sure rates for the approximate FIM that explicitly accumulate transport-based errors. A limited-memory variant with sub-quadratic storage is proposed, and the method is evaluated on variational Bayes tasks using Gaussian approximations and normalizing flows.

Significance. If the claimed rates survive the accumulated transport errors, the work would be a useful extension of natural-gradient methods to manifold-constrained settings that arise in statistical machine learning. The inversion-free online update, explicit handling of transport operations, and limited-memory implementation address both computational and geometric constraints, while the almost-sure rates provide stronger guarantees than typical in-expectation analyses.

major comments (1)

[Convergence analysis] Convergence analysis (proof of the almost-sure rate for squared distance): the argument must show that the summed transport discrepancy, which is O(s^{1-α}) when each parallel-transport error is O(‖x_{s+1}-x_s‖), remains dominated by the driving noise and step-size terms in the supermartingale inequality for all α > 2/3. An explicit bound on the per-step transport error (or its expectation) and its insertion into the distance recursion is required; without it the rate claim for α close to 2/3 is not yet verified.

minor comments (2)

[Abstract] The abstract states that the approximate FIM 'accumulates transport-based errors' yet still yields rates; the precise order of these errors and the manifold regularity assumptions (e.g., bounded curvature, geodesic convexity) should be stated explicitly in the abstract or introduction.
[Experiments] Experimental section: the choice of step-size exponent α and the concrete implementation of parallel transport (exact or approximate) should be reported for each experiment so that the claimed robustness to transport errors can be assessed.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful reading and constructive feedback on our manuscript. We appreciate the positive assessment of the method's potential and address the major comment on the convergence analysis below. We will revise the paper to incorporate the requested explicit bounds and verifications.

read point-by-point responses

Referee: [Convergence analysis] Convergence analysis (proof of the almost-sure rate for squared distance): the argument must show that the summed transport discrepancy, which is O(s^{1-α}) when each parallel-transport error is O(‖x_{s+1}-x_s‖), remains dominated by the driving noise and step-size terms in the supermartingale inequality for all α > 2/3. An explicit bound on the per-step transport error (or its expectation) and its insertion into the distance recursion is required; without it the rate claim for α close to 2/3 is not yet verified.

Authors: We agree that an explicit treatment of the transport discrepancy is necessary to fully verify the almost-sure rate for α close to 2/3. In the revised manuscript we will add a supporting lemma that bounds the per-step parallel-transport error by O(‖x_{s+1}−x_s‖) (with the implicit constant depending only on sectional-curvature bounds and the local Lipschitz constant of the score map, both assumed finite in a neighborhood of the minimizer). This bound will be inserted directly into the supermartingale recursion for the squared geodesic distance. The resulting accumulated transport term is O(s^{1-α}); we will show that, for every α>2/3, this term is dominated by the driving step-size and martingale-noise contributions that already produce the claimed O(log s / s^α) rate. The expanded argument, including the precise comparison of exponents, will appear in the appendix of the revised version. revision: yes

Circularity Check

0 steps flagged

Minor self-citation on background Riemannian tools; central convergence derivation remains independent

full rationale

The paper derives almost-sure rates O(log s / s^α) for α > 2/3 via supermartingale analysis on the squared geodesic distance, explicitly incorporating accumulated transport errors in the approximate FIM recursion. These steps rest on standard assumptions (geodesic convexity, bounded curvature for transport bounds) drawn from the Riemannian optimization literature rather than any self-defined quantity or fitted parameter. No equation reduces a claimed prediction to an input by construction, and any self-citations support only auxiliary geometric facts without bearing the load of the rate proof.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claim rests on the existence of an intrinsic Fisher information matrix on the manifold, the feasibility of transport maps between tangent spaces at successive iterates, and standard stochastic-approximation assumptions on step sizes and noise.

free parameters (1)

step-size exponent α
Must satisfy α > 2/3 for the stated almost-sure rate; chosen by the user to guarantee convergence.

axioms (2)

domain assumption The parameter space is a Riemannian manifold equipped with a well-defined Fisher information metric.
Invoked to justify the intrinsic formulation and the need for transport operations.
ad hoc to paper Score vectors from different iterates can be transported without introducing errors that invalidate the convergence analysis.
Required for the almost-sure rates to hold in the presence of manifold curvature.

pith-pipeline@v0.9.0 · 5524 in / 1442 out tokens · 37890 ms · 2026-05-13T18:12:12.978807+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We prove almost-sure convergence rates of O(log s / s^α) for the squared distance to the minimizer when the step size exponent α > 2/3. We also establish almost-sure rates for the approximate FIM, which now accumulates transport-based errors.
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

In the Riemannian setting, these score vectors belong to different tangent spaces and must be combined using transport operations.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.