pith. sign in

arxiv: 2509.08547 · v3 · pith:SXCF3O3Vnew · submitted 2025-09-10 · 🧮 math.OC · math.AP· math.FA· math.PR

Linear Convergence of Gradient Descent for Quadratically Regularized Optimal Transport

Pith reviewed 2026-05-21 22:47 UTC · model grok-4.3

classification 🧮 math.OC math.APmath.FAmath.PR
keywords optimal transportquadratic regularizationgradient descentlinear convergencedual problemspectral analysiscontinuous setting
0
0 comments X

The pith

Gradient descent on the dual of quadratically regularized optimal transport converges linearly in L2 after a burn-in period.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that gradient descent applied to the dual formulation of quadratically regularized optimal transport achieves linear convergence measured in the L2 norm. This means the distance from the current potentials to the optimal ones shrinks exponentially fast. The proof proceeds by linearizing the iteration around the solution and showing via spectral analysis that the resulting operator is a contraction mapping, with the nonlinear dynamics following suit after some initial steps. This result is significant because the dual problem lacks strong convexity, making standard convergence arguments unavailable, yet practical algorithms still succeed.

Core claim

We establish linear convergence in L², that is, the L² distance between the iterates and the limiting potentials decreases exponentially fast. The proof is based on a spectral analysis of the linearized gradient descent operator at the optimum. We show that this operator is a strict contraction and that the nonlinear iteration inherits this property after a burn-in period.

What carries the argument

The linearized gradient descent operator at the optimum, whose spectral radius less than one establishes it as a strict contraction in continuous function space.

If this is right

  • The dual gradient descent iteration converges exponentially fast in the L2 norm to the limiting potentials.
  • The nonlinear dynamics inherit the contraction property of the linearized operator after a finite number of steps.
  • This linear rate applies in a continuous function-space setting without relying on strong convexity of the dual objective.
  • The result supplies a convergence guarantee for computing quadratically regularized transport plans via gradient methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same spectral contraction may hold for other dual problems that are convex but not strongly convex, suggesting a general technique for proving linear rates.
  • Numerical verification of the spectral radius for specific cost functions could serve as a practical check before running the algorithm.
  • The burn-in period length might be bounded in terms of problem parameters, allowing explicit error estimates from the start.

Load-bearing premise

The spectral radius of the linearized gradient descent operator at the optimum is strictly less than one in the continuous setting.

What would settle it

A numerical test on concrete measures and costs where the ratio of successive L2 errors between iterates fails to approach a fixed value strictly below one after the burn-in phase.

Figures

Figures reproduced from arXiv: 2509.08547 by Alberto Gonz\'alez-Sanz, Andr\'es Riveros Valdevenito, Marcel Nutz.

Figure 1
Figure 1. Figure 1: Transporting P = N(0, 1) to Q = N(1, 1); the measures are truncated to [-3,3] and [-2,4], respectively. Left: Dual solutions of QOTϵ and OT. Right: Supports of the optimal couplings are sparse and converge to the optimal transport map. approaches have been considered, mostly targeting the dual problem of (5), sup f,g:Rd→R Z f(x) + g(y) − 1 2ε  f(x) + g(y) − 1 2 ∥x − y∥ 2 2 + d(P ⊗ Q)(x, y) (6) or its fir… view at source ↗
Figure 2
Figure 2. Figure 2: Gradient descent algorithm for three pairs of marginals: (top) [PITH_FULL_IMAGE:figures/full_fig_p021_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Repeating the first experiment (top row of Figure [PITH_FULL_IMAGE:figures/full_fig_p022_3.png] view at source ↗
read the original abstract

In optimal transport, quadratic regularization is an alternative to entropic regularization when sparse couplings or small regularization parameters are desired. Quadratic regularization penalizes transport couplings by the squared $L^2$ norm of their density, or equivalently by the $\chi^2$ divergence. While a number of computational approaches have been shown to work in practice, the dual problem is not strongly convex and theoretical convergence results are scarce. We focus on the dual gradient descent algorithm in a continuous setting and establish linear convergence in $L^2$, that is, the $L^2$ distance between the iterates and the limiting potentials decreases exponentially fast. The proof is based on a spectral analysis of the linearized gradient descent operator at the optimum. We show that this operator is a strict contraction and that the nonlinear iteration inherits this property after a burn-in period.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript establishes linear convergence in L² for the dual gradient-descent iterates of quadratically regularized optimal transport. The central argument proceeds by linearizing the gradient-descent map at the dual optimum, showing via spectral analysis that the resulting linear operator is a strict contraction on L², and then arguing that the nonlinear iteration inherits the linear rate after a finite burn-in period.

Significance. If the spectral-radius claim holds with a contraction constant independent of discretization and under standard assumptions on the cost and marginals, the result supplies the first rigorous linear-rate guarantee for gradient descent on the dual of quadratically regularized OT. The direct spectral approach (rather than any self-referential or fitted quantity) is a methodological strength that could extend to other non-strongly-convex regularizations.

major comments (2)
  1. [proof strategy paragraph and spectral-analysis section] The proof that the linearized gradient-descent operator has spectral radius strictly less than one (invoked to obtain the contraction) is asserted in the continuous L² setting but the manuscript does not supply explicit bounds on the spectrum or the step-size restriction that guarantee this for arbitrary continuous marginals and costs. Without such bounds or additional regularity assumptions (e.g., strict positivity or compactness), the contraction property and the subsequent inheritance by the nonlinear map remain unverified.
  2. [statement of main theorem and burn-in argument] The burn-in period after which the nonlinear iteration inherits the linear rate is stated to be finite, yet no uniform control (independent of the initial potentials or the discretization parameter) is provided. This leaves open whether the overall convergence rate remains linear with a constant independent of discretization, which is central to the practical relevance of the result.
minor comments (1)
  1. Notation for the linearized operator and its spectrum should be introduced with an explicit equation number at first use to improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments on our manuscript. We address each major comment below with clarifications and indicate the revisions we will make.

read point-by-point responses
  1. Referee: [proof strategy paragraph and spectral-analysis section] The proof that the linearized gradient-descent operator has spectral radius strictly less than one (invoked to obtain the contraction) is asserted in the continuous L² setting but the manuscript does not supply explicit bounds on the spectrum or the step-size restriction that guarantee this for arbitrary continuous marginals and costs. Without such bounds or additional regularity assumptions (e.g., strict positivity or compactness), the contraction property and the subsequent inheritance by the nonlinear map remain unverified.

    Authors: We appreciate the referee drawing attention to the level of detail in the spectral analysis. The manuscript establishes the strict contraction by analyzing the eigenvalues of the linearized operator, which arise from the composition of multiplication by the cost function and the orthogonal projections enforcing the marginal constraints; under the standing assumptions of continuous cost and marginals on a compact domain, this yields a spectral radius strictly less than one for all step sizes below an explicit threshold depending on the sup-norms of the cost and marginal densities. To address the request for greater explicitness, we will revise the spectral-analysis section to include a dedicated lemma that states the precise step-size restriction and the resulting bound on the spectral radius in terms of these quantities. The inheritance by the nonlinear map then follows from standard local contraction arguments in Banach space. revision: yes

  2. Referee: [statement of main theorem and burn-in argument] The burn-in period after which the nonlinear iteration inherits the linear rate is stated to be finite, yet no uniform control (independent of the initial potentials or the discretization parameter) is provided. This leaves open whether the overall convergence rate remains linear with a constant independent of discretization, which is central to the practical relevance of the result.

    Authors: The burn-in period is finite because the gradient map is continuously differentiable and the linearized operator is a contraction at the fixed point, so the iterates eventually enter a neighborhood in which the linear rate governs the behavior. We agree that uniformity with respect to discretization is desirable for practical relevance. The contraction constant itself is independent of discretization in the continuous setting; we will add a remark to the statement of the main theorem clarifying that the same constant carries over to consistent discretizations via operator-norm convergence of the discrete linearized operators to the continuous one. This preserves the linear rate while making the dependence on discretization explicit. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained via direct spectral analysis

full rationale

The paper derives linear L² convergence of gradient descent for quadratically regularized OT by performing a spectral analysis of the linearized operator at the dual optimum and establishing that this operator is a strict contraction (with the nonlinear iteration inheriting the property after burn-in). This rests on explicit operator properties in the continuous function-space setting rather than any self-definitional loop, fitted parameter renamed as prediction, or load-bearing self-citation chain. No step reduces the claimed result to its own inputs by construction; the central argument supplies independent mathematical content through the spectrum bound.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim depends on the existence of an optimum in suitable Sobolev or L2 spaces and on the well-posedness of the linearized operator; these are standard domain assumptions rather than new postulates.

axioms (2)
  • domain assumption The dual problem admits a unique optimum in the chosen function space.
    Required for the limiting potentials to exist and for linearization around that point.
  • domain assumption The linearized gradient descent operator is well-defined and bounded on L2.
    Invoked to perform the spectral analysis.

pith-pipeline@v0.9.0 · 5684 in / 1242 out tokens · 39189 ms · 2026-05-21T22:47:45.303171+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages

  1. [1]

    Bayraktar, S

    E. Bayraktar, S. Eckstein, and X. Zhang. Stability and sample complexity of divergence regu- larized optimal transport.Bernoulli, 31(1):213–239, 2025

  2. [2]

    Blondel, V

    M. Blondel, V. Seguy, and A. Rolet. Smooth and sparse optimal transport. volume 84 of Proceedings of Machine Learning Research, pages 880–889, 2018

  3. [3]

    H. Brezis. Functional analysis, Sobolev spaces and partial differential equations. Universitext. Springer, New York, 2011. 22

  4. [4]

    G. Carlier. On the linear convergence of the multimarginal Sinkhorn algorithm.SIAM J. Optim., 32(2):786–794, 2022

  5. [5]

    Chizat, A

    L. Chizat, A. Delalande, and T. Vaškevičius. Sharper exponential convergence rates for Sinkhorn’s algorithm in continuous settings.Math. Program., 2025

  6. [6]

    M. Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Weinberger, editors,Advances in Neural Infor- mation Processing Systems, volume 26. Curran Associates, Inc., 2013

  7. [7]

    Eckstein and M

    S. Eckstein and M. Kupper. Computation of optimal transport and related hedging problems via penalization and neural networks.Appl. Math. Optim., 83(2):639–667, 2021

  8. [8]

    Eckstein and M

    S. Eckstein and M. Nutz. Convergence rates for regularized optimal transport via quantization. Math. Oper. Res., 49(2):1223–1240, 2024

  9. [9]

    Essid and J

    M. Essid and J. Solomon. Quadratically regularized optimal transport on graphs.SIAM J. Sci. Comput., 40(4):A1961–A1986, 2018

  10. [10]

    Franklin and J

    J. Franklin and J. Lorenz. On the scaling of multidimensional matrices.Linear Algebra Appl., 114/115:717–735, 1989

  11. [11]

    Garriz-Molina, A

    A. Garriz-Molina, A. González-Sanz, and G. Mordant. Infinitesimal behavior of quadrat- ically regularized optimal transport and its relation with the porous medium equation. arXiv:2407.21528, 2024

  12. [12]

    Genevay, M

    A. Genevay, M. Cuturi, G. Peyré, and F. Bach. Stochastic optimization for large-scale optimal transport. In Advances in Neural Information Processing Systems 29, pages 3440–3448, 2016

  13. [13]

    González-Sanz, E

    A. González-Sanz, E. del Barrio, and M. Nutz. Sample complexity of quadratically regularized optimal transport. Forthcoming, 2025

  14. [14]

    González-Sanz, S

    A. González-Sanz, S. Eckstein, and M. Nutz. Sparse regularized optimal transport without curse of dimensionality, 2025

  15. [15]

    González-Sanz and M

    A. González-Sanz and M. Nutz. Sparsity of quadratically regularized optimal transport: Scalar case. arXiv:2410.03353, 2024

  16. [16]

    Gulrajani, F

    I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. Courville. Improved training of Wasserstein GANs. InProceedings of the 31st International Conference on Neural Information Processing Systems, pages 5769–5779, 2017

  17. [17]

    N. Lahn, D. Mulchandani, and S. Raghvendra. A graph theoretic additive approximation of optimal transport. InProceedings of the 33rd International Conference on Neural Information Processing Systems, pages 13831–13841. Curran Associates, Inc., 2019

  18. [18]

    L. Li, A. Genevay, M. Yurochkin, and J. Solomon. Continuous regularized Wasserstein barycen- ters. In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, editors,Advances in Neural Information Processing Systems, volume 33, pages 17755–17765. Curran Associates, Inc., 2020

  19. [19]

    Lorenz and H

    D. Lorenz and H. Mahler. Orlicz space regularization of continuous optimal transport problems. Appl. Math. Optim., 85(2):Paper No. 14, 33, 2022

  20. [20]

    D. A. Lorenz and H. Mahler. Orlicz-space regularization for optimal transport and algorithms for quadratic regularization.arXiv:1909.06082, 2019

  21. [21]

    D. A. Lorenz, P. Manns, and C. Meyer. Quadratically regularized optimal transport. Appl. Math. Optim., 83(3):1919–1949, 2021

  22. [22]

    Muzellec, R

    B. Muzellec, R. Nock, G. Patrini, and F. Nielsen. Tsallis regularized optimal transport and ecological inference. Proceedings of the AAAI Conference on Artificial Intelligence, 31(1), Feb. 2017. 23

  23. [23]

    Nesterov

    Y. Nesterov. Introductory lectures on convex optimization, volume 87 ofApplied Optimization. Kluwer Academic Publishers, Boston, MA, 2004. A basic course

  24. [24]

    M. Nutz. Quadratically regularized optimal transport: existence and multiplicity of potentials. SIAM J. Math. Anal., 57(3):2622–2649, 2025

  25. [25]

    Peyré and M

    G. Peyré and M. Cuturi. Computational optimal transport: With applications to data science. Foundations and Trends® in Machine Learning, 11(5–6):355–607, 2019

  26. [26]

    Schmitzer

    B. Schmitzer. Stabilized sparse scaling algorithms for entropy regularized transport problems. SIAM J. Sci. Comput., 41(3):A1443–A1481, 2019

  27. [27]

    Seguy, B

    V. Seguy, B. B. Damodaran, R. Flamary, N. Courty, A. Rolet, and M. Blondel. Large scale optimal transport and mapping estimation. InInternational Conference on Learning Represen- tations, 2018

  28. [28]

    Wiesel and X

    J. Wiesel and X. Xu. Sparsity of quadratically regularized optimal transport: Bounds on con- centration and bias.Preprint arXiv:2410.03425v1, 2024

  29. [29]

    Zhang, G

    S. Zhang, G. Mordant, T. Matsumoto, and G. Schiebinger. Manifold learning with sparse regularised optimal transport.arXiv:2307.09816, 2023. 24