arxiv: 2604.09263 · v1 · submitted 2026-04-10 · 🧮 math.OC · cs.LG· cs.NA· math.NA

Recognition: unknown

Natural Riemannian gradient for learning functional tensor networks

Nikolas Klug , Michael Ulbrich , Andr\'e Uschmajew , Marius Willner

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:56 UTC · model grok-4.3

classification 🧮 math.OC cs.LGcs.NAmath.NA

keywords functional tensor networksRiemannian gradient descentnatural gradienttree tensor networkslow-rank modelsoptimizationmachine learning

0 comments

The pith

Natural Riemannian gradients optimize functional tensor networks for any loss function without depending on basis choice.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a natural Riemannian gradient descent method for low-rank functional tree tensor networks used in machine learning. Unlike alternating optimization, which works only for least-squares regression, this method handles arbitrary losses such as multinomial logistic regression. It relies on Amari's natural gradient so that the computed search direction stays the same regardless of the basis chosen for the underlying functional tensor product space. The framework covers both factorized and manifold-based representations of the networks, and it supplies a hierarchy of practical approximations to the full gradient for efficient computation. Experiments on standard classification datasets show faster convergence than ordinary Riemannian gradient steps.

Core claim

A natural Riemannian gradient descent approach, derived from Amari's natural gradient, produces a search direction that is independent of the basis of the functional tensor product space. This property holds for both the factorized and the manifold-based representations of low-rank functional tree tensor networks. For any loss function the method supplies a hierarchy of efficient approximations that can be used to compute parameter updates, and these updates yield measurably faster convergence on classification tasks than standard Riemannian gradient descent.

What carries the argument

The natural Riemannian gradient computed in the functional tensor product space, which yields a basis-independent search direction for parameter updates.

Load-bearing premise

The hierarchy of efficient approximations to the true natural Riemannian gradient retains enough accuracy to produce the claimed convergence gains without introducing instability or systematic bias.

What would settle it

On the same classification datasets, the natural-gradient updates produce convergence curves and final accuracies indistinguishable from or worse than those obtained by ordinary Riemannian gradient descent.

Figures

Figures reproduced from arXiv: 2604.09263 by Andr\'e Uschmajew, Marius Willner, Michael Ulbrich, Nikolas Klug.

**Figure 1.** Figure 1: Functional tensor train (left) and functional (binary) tree tensor network (right). for some Ak ∈ R rk×nk×rk+1 , k = 0, . . . , d, where r0 = rd+1 = 1.1 The vector r = (r1, . . . , rd) is called the TT-rank of A, assuming all values rk are as small as possible for such a representation to exist. We can then conversely consider the manifold of all tensors with a fixed TT-rank r. In this case we choose M = R… view at source ↗

**Figure 2.** Figure 2: Comparison of grad and ngrad methods for a recovery problem under change of basis. The natural Riemannian gradient methods achieve the expected minimum loss (black line). As the starting iterate for the experiments, we take a second randomly generated TTN h0 : R 4 → R 3 of the similar type with parameter Θ0 ∈ M. For each individual experiment we choose identical basis vectors Φν for all ν = 1, . . . , 4, b… view at source ↗

**Figure 3.** Figure 3: Comparison of standard Riemannian gradient descent (grad), natural Riemannian gradient descent (ngrad), BD-ngrad, BDO-ngrad and D-ngrad for the digits dataset. grad ngrad BD-ngrad BDO-ngrad D-ngrad 98.71% 98.71 % 98.71 % 98.20 % 97.17 % [PITH_FULL_IMAGE:figures/full_fig_p026_3.png] view at source ↗

**Figure 4.** Figure 4: Comparison of stochastic Riemannian gradient descent (grad) and D-ngrad for the MNIST dataset. employ the same feature map as in Section 4.3 and again use the unsupervised coarse-graining method [37] for initialization, this time with maximum tree tensor ranks of 16. Again, we employ a random 80/20 train-test-split. In our experiments, we compare stochastic Riemannian gradient descent with momentum (grad) … view at source ↗

read the original abstract

We consider machine learning tasks with low-rank functional tree tensor networks (TTN) as the learning model. While in the case of least-squares regression, low-rank functional TTNs can be efficiently optimized using alternating optimization, this is not directly possible in other problems, such as multinomial logistic regression. We propose a natural Riemannian gradient descent type approach applicable to arbitrary losses which is based on the natural gradient by Amari. In particular, the search direction obtained by the natural gradient is independent of the choice of basis of the underlying functional tensor product space. Our framework applies to both the factorized and manifold-based approach for representing the functional TTN. For practical application, we propose a hierarchy of efficient approximations to the true natural Riemannian gradient for computing the updates in the parameter space. Numerical experiments confirm our theoretical findings on common classification datasets and show that using natural Riemannian gradient descent for learning considerably improves convergence behavior when compared to standard Riemannian gradient methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adapts Amari's natural gradient to functional TTNs so the direction stays basis-independent and works for general losses, with a hierarchy of approximations that seems workable but lightly supported.

read the letter

The core advance is a natural Riemannian gradient for low-rank functional tree tensor networks that applies to arbitrary losses instead of being limited to least-squares regression. The search direction comes out independent of the basis chosen for the underlying function space, which follows from using the intrinsic metric on the tensor manifold rather than a coordinate-dependent Euclidean one. They also give a hierarchy of cheaper approximations to make the updates feasible in practice, and they test the whole thing on classification tasks where alternating least squares no longer applies directly.

Referee Report

2 major / 3 minor

Summary. The paper proposes a natural Riemannian gradient descent method for optimizing low-rank functional tree tensor networks (TTNs) as models in machine learning tasks. It extends Amari's natural gradient to arbitrary loss functions (beyond least-squares regression), establishes that the resulting search direction is independent of the choice of basis for the underlying functional tensor product space, covers both factorized and manifold-based TTN representations, introduces a hierarchy of efficient approximations to the true natural gradient for practical computation, and reports numerical experiments on classification datasets demonstrating improved convergence compared to standard Riemannian gradient descent.

Significance. If the central claims hold, the work offers a principled geometric optimization framework that generalizes Riemannian methods to functional TTNs with general losses while preserving intrinsic basis independence. This could enable more effective training of tensor network models in classification and other non-regression settings. The explicit use of an intrinsic Riemannian metric (rather than coordinate-dependent gradients) and the hierarchy of approximations are notable strengths; the empirical results provide concrete evidence of practical benefit. The approach builds directly on established Riemannian optimization and natural gradient literature without introducing circularity or undefined quantities.

major comments (2)

[§4] §4 (Hierarchy of approximations): The central practical claim relies on a hierarchy of approximations to the true natural Riemannian gradient, yet no error bounds, stability analysis, or convergence-rate perturbation results are provided for these approximations. This is load-bearing because the abstract and experiments assert improved convergence; without such analysis it is unclear whether the approximations preserve the claimed advantages or introduce bias/instability for arbitrary losses.
[§5] §5 (Numerical experiments): The experiments confirm faster convergence, but the text does not report statistical significance tests, variance across random seeds, or ablation studies isolating the effect of the natural gradient versus the choice of approximation level. This weakens the strength of the empirical support for the method's superiority.

minor comments (3)

[§3] The abstract states that the method applies to both factorized and manifold-based approaches, but the main text should clarify in one place (e.g., §3) whether the basis-independence proof carries over identically to both representations or requires separate arguments.
Notation for the functional tensor product space and the Riemannian metric could be introduced earlier and used consistently; occasional shifts between coordinate and intrinsic descriptions make the independence argument harder to follow on first reading.
A short comparison table or paragraph contrasting the proposed natural gradient with the Euclidean gradient and with existing alternating least-squares methods for the regression case would help readers situate the contribution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading, positive assessment, and recommendation for minor revision. We address the major comments point by point below.

read point-by-point responses

Referee: [§4] §4 (Hierarchy of approximations): The central practical claim relies on a hierarchy of approximations to the true natural Riemannian gradient, yet no error bounds, stability analysis, or convergence-rate perturbation results are provided for these approximations. This is load-bearing because the abstract and experiments assert improved convergence; without such analysis it is unclear whether the approximations preserve the claimed advantages or introduce bias/instability for arbitrary losses.

Authors: We thank the referee for this observation. The hierarchy is introduced to ensure computational tractability for high-dimensional functional tensor spaces while preserving the key property of basis independence. General error bounds for arbitrary losses are difficult to derive without additional assumptions on the loss or the functional space, which would narrow the scope of the contribution. In the revised manuscript we will expand §4 with a qualitative discussion of the approximation levels, their design rationale, computational trade-offs, and the fact that the exact natural gradient (and its basis independence) is recovered in the limit of the hierarchy. We will also note the absence of general perturbation bounds as a limitation and direction for future work. revision: partial
Referee: [§5] §5 (Numerical experiments): The experiments confirm faster convergence, but the text does not report statistical significance tests, variance across random seeds, or ablation studies isolating the effect of the natural gradient versus the choice of approximation level. This weakens the strength of the empirical support for the method's superiority.

Authors: We agree that the experimental section would benefit from additional statistical rigor. In the revised manuscript we will report results averaged over multiple random seeds together with standard deviations, include statistical significance tests (e.g., paired t-tests) on the convergence metrics, and add ablation studies that vary the approximation level while keeping the natural-gradient formulation fixed. These changes will strengthen the empirical evidence without altering the core claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper extends Amari's established natural gradient to the manifold of low-rank functional TTNs via the intrinsic Riemannian metric, yielding basis-independence by construction of the geometry rather than by redefinition or fitting. The hierarchy of approximations is introduced as a practical implementation choice separate from the core claim, and numerical experiments serve as external validation rather than tautological confirmation. No load-bearing step reduces to a self-citation chain, a fitted parameter renamed as prediction, or an ansatz smuggled via prior work by the same authors. The central independence property follows directly from using the manifold's own metric instead of coordinate-dependent gradients.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The method relies on standard assumptions from Riemannian geometry and information geometry without introducing new free parameters or postulated entities in the abstract description.

axioms (2)

domain assumption The space of low-rank functional tree tensor networks forms a Riemannian manifold suitable for gradient-based optimization.
Invoked to apply Riemannian gradient descent techniques.
standard math Amari's natural gradient yields a search direction independent of basis choice in the functional tensor product space.
Core property relied upon for the method's invariance claim.

pith-pipeline@v0.9.0 · 5473 in / 1376 out tokens · 45451 ms · 2026-05-10T16:56:21.113483+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

42 extracted references · 4 canonical work pages

[1]

Absil, R

P.-A. Absil, R. Mahony, and R. Sepulchre.Optimization algorithms on matrix manifolds. Princeton University Press, Princeton, NJ, 2008

2008
[2]

Alpaydin and C

E. Alpaydin and C. Kaynak. Optical Recognition of Handwritten Digits. UCI Machine Learning Repository, 1998

1998
[3]

S.-I. Amari. Information geometry. InGeometry and nature (Madeira, 1995), volume 203 ofContemp. Math., pages 81–95. Amer. Math. Soc., Providence, RI, 1997

1995
[4]

Natural gradient works efficiently in learning.Neural Comput., 10(2): 251–276, 1998

Shun-ichi Amari. Natural gradient works efficiently in learning.Neural Comput., 10(2): 251–276, 1998

1998
[5]

American Mathematical Society, Providence, RI; Oxford University Press, Oxford, 2000

Shun-ichi Amari and Hiroshi Nagaoka.Methods of information geometry, volume 191 of Translations of Mathematical Monographs. American Mathematical Society, Providence, RI; Oxford University Press, Oxford, 2000

2000
[6]

Low-rank tensor methods for partial differential equations.Acta Numer., 32:1–121, 2023

Markus Bachmayr. Low-rank tensor methods for partial differential equations.Acta Numer., 32:1–121, 2023

2023
[7]

Tensor networks and hierarchical tensors for the solution of high-dimensional partial differential equations.Found

Markus Bachmayr, Reinhold Schneider, and Andr´ e Uschmajew. Tensor networks and hierarchical tensors for the solution of high-dimensional partial differential equations.Found. Comput. Math., 16(6):1423–1472, 2016

2016
[8]

Bader and Tamara G

Brett W. Bader and Tamara G. Kolda. Efficient MATLAB computations with sparse and factored tensors.SIAM J. Sci. Comput., 30(1):205–231, 2007/08

2007
[9]

Cambridge University Press, Cambridge, 2023

Nicolas Boumal.An introduction to optimization on smooth manifolds. Cambridge University Press, Cambridge, 2023

2023
[10]

Z. Chen, K. Batselier, J. A. K. Suykens, and N. Wong. Parallelized tensor train learning of polynomial classifiers.IEEE Trans. Neural Netw. Learn. Syst., 29(10):4621–4632, 2018

2018
[11]

Herrmann

Curt Da Silva and Felix J. Herrmann. Optimization on the hierarchical Tucker manifold— applications to tensor completion.Linear Algebra Appl., 481:131–173, 2015

2015
[12]

Approximation and learning with compositional tensor trains.arXiv:2512.18059, 2025

Martin Eigel, Charles Miranda, Anthony Nouy, and David Sommer. Approximation and learning with compositional tensor trains.arXiv:2512.18059, 2025

work page arXiv 2025
[13]

Gorodetsky and John D

Alex A. Gorodetsky and John D. Jakeman. Gradient-based optimization for regression in the functional tensor-train format.J. Comput. Phys., 374:1219–1238, 2018

2018
[14]

A literature survey of low-rank tensor approximation techniques.GAMM-Mitt., 36(1):53–78, 2013

Lars Grasedyck, Daniel Kressner, and Christine Tobler. A literature survey of low-rank tensor approximation techniques.GAMM-Mitt., 36(1):53–78, 2013. 28

2013
[15]

Hackbusch and S

W. Hackbusch and S. K¨ uhn. A new scheme for the tensor representation.J. Fourier Anal. Appl., 15(5):706–722, 2009

2009
[16]

Springer, Cham, second edition, 2019

Wolfgang Hackbusch.Tensor spaces and numerical tensor calculus. Springer, Cham, second edition, 2019

2019
[17]

On manifolds of tensors of fixed TT-rank.Numer

Sebastian Holtz, Thorsten Rohwedder, and Reinhold Schneider. On manifolds of tensors of fixed TT-rank.Numer. Math., 120(4):701–731, 2012

2012
[18]

Rieman- nian natural gradient methods.SIAM J

Jiang Hu, Ruicheng Ao, Anthony Man-Cho So, Minghan Yang, and Zaiwen Wen. Rieman- nian natural gradient methods.SIAM J. Sci. Comput., 46(1):A204–A231, 2024

2024
[19]

Fadam: Adam is a nat- ural gradient optimizer using diagonal empirical fisher in- formation.arXiv preprint arXiv:2405.12807,

Dongseong Hwang. FAdam: Adam is a natural gradient optimizer using diagonal empirical Fisher information.arXiv:2405.12807, 2024

work page arXiv 2024
[20]

Khoromskij.Tensor numerical methods in scientific computing

Boris N. Khoromskij.Tensor numerical methods in scientific computing. De Gruyter, Berlin, 2018

2018
[21]

Kingma and Jimmy Ba

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. InICLR 2015, January 2017

2015
[22]

Tensor-based algorithms for image classification.Algorithms, 12(11):240, 2019

Stefan Klus and Patrick Gelß. Tensor-based algorithms for image classification.Algorithms, 12(11):240, 2019

2019
[23]

Low-rank tensor completion by Riemannian optimization.BIT, 54(2):447–468, 2014

Daniel Kressner, Michael Steinlechner, and Bart Vandereycken. Low-rank tensor completion by Riemannian optimization.BIT, 54(2):447–468, 2014

2014
[24]

Preconditioned low-rank Riemannian optimization for linear systems with tensor product structure.SIAM J

Daniel Kressner, Michael Steinlechner, and Bart Vandereycken. Preconditioned low-rank Riemannian optimization for linear systems with tensor product structure.SIAM J. Sci. Comput., 38(4):A2018–A2044, 2016

2016
[25]

Yann LeCun, Corinna Cortes, and Christopher J. C. Burges. The MNIST database of handwritten digits, 1998

1998
[26]

Lee.Introduction to Riemannian manifolds

John M. Lee.Introduction to Riemannian manifolds. Springer, Cham, second edition, 2018

2018
[27]

New insights and perspectives on the natural gradient method.Journal of Machine Learning Research, 21(146):1–76, 2020

James Martens. New insights and perspectives on the natural gradient method.Journal of Machine Learning Research, 21(146):1–76, 2020

2020
[28]

Optimizing neural networks with kronecker-factored approximate curvature

James Martens and Roger Grosse. Optimizing neural networks with kronecker-factored approximate curvature. In Francis Bach and David Blei, editors,Proceedings of the 32nd In- ternational Conference on Machine Learning, volume 37 ofProceedings of Machine Learning Research, pages 2408–2417. PMLR, 2015

2015
[29]

Optimizing neural networks with kronecker-factored approximate curvature.arXiv:1503.05671, 2020

James Martens and Roger Grosse. Optimizing neural networks with kronecker-factored approximate curvature.arXiv:1503.05671, 2020

work page arXiv 2020
[30]

Novikov, M

A. Novikov, M. Trofimov, and I. Oseledets. Exponential machines.Bull. Pol. Acad. Sci. Tech. Sci., 66(6):789–797, 2018

2018
[31]

I. V. Oseledets. Tensor-train decomposition.SIAM J. Sci. Comput., 33(5):2295–2317, 2011

2011
[32]

Radhakrishna Rao

C. Radhakrishna Rao. Information and the accuracy attainable in the estimation of statis- tical parameters.Bull. Calcutta Math. Soc., 37:81–91, 1945. 29

1945
[33]

Low-rank Riemannian eigensolver for high-dimensional Hamiltonians.J

Maxim Rakhuba, Alexander Novikov, and Ivan Oseledets. Low-rank Riemannian eigensolver for high-dimensional Hamiltonians.J. Comput. Phys., 396:718–737, 2019

2019
[34]

Schneider and M

R. Schneider and M. Oster. Some thoughts on compositional tensor networks. InMultiscale, nonlinear and adaptive approximation II, pages 419–447. Springer, Cham
[35]

Cambridge University Press, 2014

Shai Shalev-Shwartz and Shai Ben-David.Understanding Machine Learning. Cambridge University Press, 2014

2014
[36]

Riemannian optimization for high-dimensional tensor completion

Michael Steinlechner. Riemannian optimization for high-dimensional tensor completion. SIAM J. Sci. Comput., 38(5):S461–S484, 2016

2016
[37]

Miles Stoudenmire

E. Miles Stoudenmire. Learning relevant features of data with multi-scale tensor networks. Quantum Sci. Technol., 3(3):034003, 2018

2018
[38]

Supervised learning with tensor networks

Edwin Stoudenmire and David J Schwab. Supervised learning with tensor networks. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 29, pages 4799–4807. Curran Associates, Inc., 2016

2016
[39]

The geometry of algorithms using hierarchical tensors.Linear Algebra Appl., 439(1):133–166, 2013

Andr´ e Uschmajew and Bart Vandereycken. The geometry of algorithms using hierarchical tensors.Linear Algebra Appl., 439(1):133–166, 2013

2013
[40]

Geometric methods on low-rank matrix and tensor manifolds

Andr´ e Uschmajew and Bart Vandereycken. Geometric methods on low-rank matrix and tensor manifolds. InHandbook of variational methods for nonlinear geometric data, pages 261–313. Springer, Cham, 2020

2020
[41]

Riemannian optimization on tree tensor networks with application in machine learning.arXiv:2507.21726, 2025

Marius Willner, Marco Trenti, and Dirk Lebiedz. Riemannian optimization on tree tensor networks with application in machine learning.arXiv:2507.21726, 2025

work page arXiv 2025
[42]

Expectation-maximization alter- nating least squares for tensor network logistic egression.Frontiers Appl

Naoya Yamauchi, Hidekata Hontani, and Tatsuya Yokota. Expectation-maximization alter- nating least squares for tensor network logistic egression.Frontiers Appl. Math. Stat., 11, 2025. 30

2025