arxiv: 2605.04175 · v1 · submitted 2026-05-05 · 💻 cs.LG · math.OC

Recognition: 2 theorem links

· Lean Theorem

A Provably Convergent and Practical Algorithm for Gromov--Wasserstein Optimal Transport

Ling Liang , Lei Yang

Authors on Pith no claims yet

Pith reviewed 2026-05-08 17:42 UTC · model grok-4.3

classification 💻 cs.LG math.OC

keywords Gromov-Wasserstein optimal transportprojected gradient methodinexact optimizationconvergence analysisoptimal transportnonconvex optimizationtransport polytope

0 comments

The pith

An inexact projected-gradient method with a computable feasibility-residual condition converges to stationary points for squared-loss Gromov-Wasserstein optimal transport.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that for squared-loss Gromov-Wasserstein optimal transport, an inexact projected-gradient algorithm can achieve provable convergence even when projections are solved approximately. It introduces a computable condition based on the feasibility residual of the projection subproblem, which does not require knowing the exact projection. This bridges the gap between practical implementations and theoretical guarantees, allowing the method to retain the efficiency of projected gradients while ensuring convergence to stationary points, and full sequence convergence under a mild additional condition on tolerance decay.

Core claim

For squared-loss GWOT, the inexact projected-gradient framework with a verifiable feasibility-residual-based inexact condition for the projection subproblem proves subsequential convergence to stationary points. With a mild tolerance-decay condition on the inexactness tolerances, the whole sequence converges. The resulting method retains the simplicity and sparsity of projected-gradient schemes while providing rigorous convergence guarantees without assuming exact projections.

What carries the argument

The inexact projected-gradient framework equipped with a verifiable feasibility-residual-based inexact condition for the projection subproblem onto the transport polytope.

If this is right

The algorithm retains the simplicity and sparsity of projected-gradient schemes for GWOT.
It provides rigorous convergence guarantees that turn projected-gradient methods into a principled and scalable approach.
The condition is directly computable from residuals and avoids reliance on unknown quantities such as the exact projection point.
Large-scale GWOT computations become reliable with provable stationarity guarantees.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same residual-based inexactness test could be tested on other nonconvex optimal transport problems where exact projections remain prohibitive.
In domain-adaptation pipelines, the method may reduce the need for heuristic early stopping by supplying an explicit convergence monitor.
Relaxing the tolerance-decay requirement further would widen the set of practical implementations that still carry formal guarantees.

Load-bearing premise

The mild tolerance-decay condition on the inexactness tolerances for the projection subproblems, which upgrades subsequential convergence to convergence of the entire sequence.

What would settle it

A numerical example in which the feasibility-residual inexact condition holds for every projection step yet the iterates do not approach any stationary point of the squared-loss GWOT objective.

Figures

Figures reproduced from arXiv: 2605.04175 by Lei Yang, Ling Liang.

**Figure 1.** Figure 1: GW-distance between samples from 2 Gaussian distributions in 2- and 3-D spaces; view at source ↗

**Figure 2.** Figure 2: Computation results with n ∈ {100, 200, 300, 400, 500}. 4 Conclusions In this paper, we developed a provably convergent inexact projected gradient framework for solving the GWOT problem, with a particular emphasis on practical implementability with convergence guarantees. Our main contribution is a truly verifiable inexact condition for the approximate projection step onto the transport polytope, which av… view at source ↗

read the original abstract

Gromov--Wasserstein optimal transport (GWOT) aligns metric measure spaces by matching their within-domain relational structures, but large-scale GWOT remains challenging because its objective is nonconvex and projection onto the transport polytope is often solved only approximately in practice. This leads to a gap between practical projected-gradient implementations and convergence theory, which typically assumes exact projections. For squared-loss GWOT, we propose an inexact projected-gradient framework with a verifiable feasibility-residual-based inexact condition for the projection subproblem. This condition is directly computable and avoids unknown quantities such as the exact projection point. Under this implementable condition, we prove subsequential convergence to stationary points and, with a mild tolerance-decay condition, convergence of the whole sequence. The resulting method retains the simplicity and sparsity of projected-gradient schemes while providing rigorous convergence guarantees, turning projected-gradient methods into a principled and scalable approach for GWOT with provable reliability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a computable residual condition for inexact projections in projected gradient on squared-loss GWOT that supports subsequential convergence to stationary points and full convergence under a mild tolerance decay.

read the letter

The main takeaway is that this work closes a practical gap in GWOT solvers by replacing the usual exact-projection assumption with a feasibility-residual check that you can actually compute from an approximate solver output. For squared-loss Gromov-Wasserstein, they build an inexact projected-gradient scheme around this condition and prove subsequential convergence to stationary points, then upgrade to full-sequence convergence when the tolerances shrink at a controlled rate. The condition avoids needing the exact projection point, which is the usual sticking point in theory versus code. That is the concrete advance over generic inexact-gradient results. The proofs follow standard arguments for inexact methods but are adapted to the nonconvex GWOT setting and the specific residual definition, and nothing in the argument looks circular or reliant on uncheckable quantities. The method keeps the sparsity and simplicity of basic projected gradient while adding the guarantees. One limitation is that the guarantees are stated only for squared loss; extending the same residual condition to other GWOT objectives would need separate work. The tolerance-decay rule is mild on paper but will still require some care when tuning step sizes and inner solver accuracy on real instances to avoid slowing things down. The paper is aimed at people who already run projected-gradient or Sinkhorn-style methods on structured data and want a version they can trust without hand-waving the convergence. It is worth a serious referee because the central claim is narrow, implementable, and backed by explicit conditions rather than hand-wavy assumptions. I would send it out for review.

Referee Report

0 major / 3 minor

Summary. The manuscript proposes an inexact projected-gradient framework for squared-loss Gromov-Wasserstein optimal transport. It introduces a feasibility-residual-based inexact condition for the projection subproblem onto the transport polytope that is directly computable without requiring the exact projection point. Under this condition the iterates are shown to have subsequential convergence to stationary points; an additional mild tolerance-decay condition on the inexactness tolerances upgrades the guarantee to convergence of the full sequence. The method is presented as retaining the simplicity and sparsity of standard projected-gradient schemes while supplying rigorous convergence guarantees.

Significance. If the stated convergence results hold, the work is significant for closing the theory-practice gap in large-scale GWOT. It supplies verifiable, implementable inexactness criteria that allow projected-gradient methods to be used with provable reliability on a nonconvex problem, which is valuable for applications in domain adaptation, shape matching, and other metric-measure-space tasks where approximate projections are inevitable.

minor comments (3)

The abstract states that the feasibility residual is 'directly computable and avoids unknown quantities such as the exact projection point'; the main text should include an explicit side-by-side comparison of the proposed residual with a standard inexactness criterion that would require the exact projection, to make the advantage concrete.
The mild tolerance-decay condition is load-bearing for upgrading subsequential to full-sequence convergence; a short discussion or numerical illustration of how the decay rate can be chosen in practice for typical GWOT instances would strengthen the claim that the condition is mild and implementable.
Notation for the transport polytope and the squared-loss objective should be introduced with one additional sentence of context in the problem-statement section for readers who are not already expert in optimal transport.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of our manuscript, the recognition of its significance in bridging theory and practice for large-scale Gromov-Wasserstein optimal transport, and the recommendation for minor revision. The report accurately captures the core contributions of the inexact projected-gradient framework, the feasibility-residual inexactness condition, and the convergence guarantees.

Circularity Check

0 steps flagged

No significant circularity; derivation applies standard inexact PG theory

full rationale

The central contribution is an implementable feasibility-residual inexactness condition for the projection subproblem in projected-gradient methods for squared-loss GWOT, together with a subsequential convergence proof to stationary points (and full-sequence convergence under a mild tolerance-decay condition). These arguments rest on standard optimization theory for inexact projected gradients applied to the nonconvex problem; the condition is defined directly in terms of computable residuals and does not reduce the claimed convergence result to a tautology, a fitted parameter, or a self-citation chain. No load-bearing self-definitional steps, renamed empirical patterns, or uniqueness theorems imported from the authors' prior work appear in the derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 3 axioms · 0 invented entities

The central claim rests on standard assumptions about metric measure spaces for GWOT plus two paper-specific conditions: the feasibility-residual inexactness criterion and the mild tolerance decay. No free parameters or invented entities are introduced.

axioms (3)

domain assumption Squared-loss formulation of the GWOT objective
The framework is developed specifically for the squared-loss case.
domain assumption Projection subproblems admit iterates satisfying the feasibility-residual inexact condition
This is the key implementable condition that replaces exact projection.
ad hoc to paper Mild tolerance-decay condition on inexactness tolerances
Introduced to obtain convergence of the full sequence rather than only subsequences.

pith-pipeline@v0.9.0 · 5458 in / 1474 out tokens · 97813 ms · 2026-05-08T17:42:14.046955+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean (Jcost = ½(x+x⁻¹)−1) washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

min_{Π∈U(a,b)} f(Π) := −⟨C_X Π C_Y, Π⟩, where f is L_f-smooth with L_f := 2‖C_X‖₂‖C_Y‖₂

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

47 extracted references · 4 canonical work pages · 2 internal anchors

[1]

Altschuler, J

J. Altschuler, J. Niles-Weed, and P. Rigollet. Near-linear time approximation algorithms for optimal transport via Sinkhorn iteration. InAdvances in Neural Information Processing Systems, 2017

2017
[2]

Attouch, J

H. Attouch, J. Bolte, P. Redont, and A. Soubeyran. Proximal alternating minimization and projection methods for nonconvex problems: An approach based on the Kurdyka-Łojasiewicz inequality.Mathematics of Operations Research, 35(2):438–457, 2010

2010
[3]

Attouch, J

H. Attouch, J. Bolte, and B. F. Svaiter. Convergence of descent methods for semi-algebraic and tame problems: Proximal algorithms, forward–backward splitting, and regularized Gauss–Seidel methods.Mathematical Programming, 137(1):91–129, 2013

2013
[4]

Beck and M

A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems.SIAM Journal on Imaging Sciences, 2(1):183–202, 2009

2009
[5]

Bolte, A

J. Bolte, A. Daniilidis, and A. Lewis. The Łojasiewicz inequality for nonsmooth subanalytic functions with applications to subgradient dynamical systems.SIAM Journal on Optimization, 17(4):1205–1223, 2007

2007
[7]

Bolte, S

J. Bolte, S. Sabach, and M. Teboulle. Proximal alternating linearized minimization for noncon- vex and nonsmooth problems.Mathematical Programming, 146(1):459–494, 2014

2014
[8]

Bonneel, J

N. Bonneel, J. Rabin, G. Peyré, and H. Pfister. Sliced and Radon Wasserstein barycenters of measures.Journal of Mathematical Imaging and Vision, 51(1):22–45, 2015

2015
[9]

Chen, B.T

J. Chen, B.T. Nguyen, S.H. Koh, and Y .S. Soh. Semidefinite relaxations of the Gromov– Wasserstein distance. InAdvances in Neural Information Processing Systems, 2024

2024
[10]

H.T.M. Chu, L. Liang, K.-C. Toh, and L. Yang. An efficient implementable inexact entropic proximal point algorithm for a class of linear programming problems.Computational Optimiza- tion and Applications, 85(1):107–146, 2023

2023
[11]

M. Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. InAdvances in Neural Information Processing Systems, 2013

2013
[12]

Demetci, R

P. Demetci, R. Santorella, B. Sandstede, W. S. Noble, and R. Singh. Gromov–Wasserstein optimal transport to align single-cell multi-omics data.bioRxiv, 2020

2020
[13]

Frank and P

M. Frank and P. Wolfe. An algorithm for quadratic programming.Naval Research Logistics Quarterly, 3(1–2):95–110, 1956

1956
[14]

Genevay, M

A. Genevay, M. Cuturi, G. Peyré, and F. Bach. Stochastic optimization for large-scale optimal transport. InAdvances in Neural Information Processing Systems, 2016

2016
[15]

Hiriart-Urruty and C

J.-B. Hiriart-Urruty and C. Lemaréchal.Convex analysis and minimization algorithms II: Advanced theory and bundle methods. Springer, 1993

1993
[16]

A. J. Hoffman. On approximate solutions of systems of linear inequalities.Journal of Research of the National Bureau of Standards, 49(4):263–265, 1952. 10

1952
[17]

D. Hou, L. Liang, and K.-C. Toh. A sparse smoothing Newton method for solving discrete optimal transport problems.ACM Transactions on Mathematical Software, 50(3):1–26, 2024

2024
[18]

Kantorovich

L.V . Kantorovich. On the translocation of masses.Journal of Mathematical Sciences, 133(4), 2006

2006
[19]

J. Li, J. Tang, L. Kong, H. Liu, J. Li, A.M.-C. So, and J. Blanchet. A convergent single-loop algorithm for relaxation of Gromov–Wasserstein in graph data. InInternational Conference on Learning Representations, 2023

2023
[20]

X. Li, D.F. Sun, and K.-C. Toh. An asymptotically superlinearly convergent semismooth Newton augmented Lagrangian method for linear programming.SIAM Journal on Optimization, 30(3):2410–2440, 2020

2020
[21]

Liang, C

L. Liang, C. Austin, and H. Yang. Accelerating multi-block constrained optimization through learning to optimize.arXiv preprint arXiv:2409.17320, 2024

work page arXiv 2024
[22]

Liang, D.F

L. Liang, D.F. Sun, and K.-C. Toh. An inexact augmented Lagrangian method for second-order cone programming with applications.SIAM Journal on Optimization, 31(3):1748–1773, 2021

2021
[23]

Y . Liu, Z. Wen, and W. Yin. A multiscale semi-smooth Newton method for optimal transport. Journal of Scientific Computing, 91(2):39, 2022

2022
[24]

and Yang, J

H. Lu and J. Yang. PDOT: A practical primal-dual algorithm and a GPU-based solver for optimal transport.arXiv preprint arXiv:2407.19689, 2024

work page arXiv 2024
[25]

X. Ma, X. Chu, Y . Wang, Y . Lin, J. Zhao, L. Ma, and W. Zhu. Fused Gromov–Wasserstein graph mixup for graph-level classifications. InAdvances in Neural Information Processing Systems, 2023

2023
[26]

F. Mémoli. Gromov–Wasserstein distances and the metric approach to object matching.Foun- dations of Computational Mathematics, 11(4):417–487, 2011

2011
[27]

Nesterov

Y . Nesterov. A method of solving a convex programming problem with convergence rate o(1/k2).Soviet Mathematics Doklady, 27(2):372–376, 1983

1983
[28]

Peyré and M

G. Peyré and M. Cuturi.Computational optimal transport: With applications to data science. Now Foundations and Trends, 2019

2019
[29]

Peyré, M

G. Peyré, M. Cuturi, and J. Solomon. Gromov–Wasserstein averaging of kernel and distance matrices. InInternational Conference on Machine Learning, 2016

2016
[30]

Polyak.Introduction to Optimization

B.T. Polyak.Introduction to Optimization. Optimization Software Inc., New York, 1987

1987
[31]

Rockafellar and R.J.-B

R.T. Rockafellar and R.J.-B. Wets.Variational Analysis. Springer, 1998

1998
[32]

Scetbon, M

M. Scetbon, M. Cuturi, and G. Peyré. Low-rank Sinkhorn factorization. InInternational Conference on Machine Learning, 2021

2021
[33]

Scetbon, G

M. Scetbon, G. Peyré, and M. Cuturi. Linear-time Gromov–Wasserstein distances using low rank couplings and costs. InInternational Conference on Machine Learning, 2022

2022
[34]

Séjourné, F.-X

T. Séjourné, F.-X. Vialard, and G. Peyré. The unbalanced Gromov–Wasserstein distance: Conic formulation and relaxation. InAdvances in Neural Information Processing Systems, 2021

2021
[35]

Vayer, L

T. Vayer, L. Chapel, R. Flamary, R. Tavenard, and N. Courty. Optimal transport for structured data with application on graphs. InInternational Conference on Machine Learning, 2019

2019
[36]

Villani.Optimal Transport: Old and New, volume 338

C. Villani.Optimal Transport: Old and New, volume 338. Springer, 2009

2009
[37]

Vincent-Cuaz, R

C. Vincent-Cuaz, R. Flamary, M. Corneli, T. Vayer, and N. Courty. Semi-relaxed Gromov– Wasserstein divergence and applications on graphs. InInternational Conference on Learning Representations, 2022. 11

2022
[38]

D. Wu, L. Liang, and H. Yang. PINS: Proximal iterations with sparse Newton and Sinkhorn for optimal transport.arXiv preprint arXiv:2502.03749, 2025

work page internal anchor Pith review arXiv 2025
[39]

Y . Xie, X. Wang, R. Wang, and H. Zha. A fast proximal point method for computing exact Wasserstein distance. InUncertainty in Artificial Intelligence, 2020

2020
[40]

H. Xu, D. Luo, H. Zha, and L. Carin. Gromov–Wasserstein learning for graph matching and node embedding. InInternational Conference on Machine Learning, 2019

2019
[41]

Y . Yan, W. Li, H. Wu, H. Min, M. Tan, and Q. Wu. Semi-supervised optimal transport for heterogeneous domain adaptation. InInternational Joint Conference on Artificial Intelligence, 2018

2018
[42]

H. Yang, L. Liang, L. Carlone, and K.-C. Toh. An inexact projected gradient method with rounding and lifting by nonlinear programming for solving rank-one semidefinite relaxation of polynomial optimization.Mathematical Programming, 201(1):409–472, 2023

2023
[43]

L. Yang, J. Li, D.F. Sun, and K.-C. Toh. A fast globally linearly convergent algorithm for the computation of Wasserstein barycenters.Journal of Machine Learning Research, 22(21):1–37, 2021

2021
[44]

L. Yang, L. Liang, H.T.M. Chu, and K.-C. Toh. A corrected inexact proximal augmented Lagrangian method with a relative error criterion for a class of group-quadratic regularized optimal transport problems.Journal of Scientific Computing, 99(3):79, 2024

2024
[45]

Yang and K.-C

L. Yang and K.-C. Toh. Bregman proximal point algorithm revisited: A new inexact version and its inertial variant.SIAM Journal on Optimization, 32(3):1523–1554, 2022

2022
[46]

Zhang, Z

G. Zhang, Z. Gu, Y . Yuan, and D.F. Sun. HOT: An efficient Halpern accelerating algorithm for optimal transport problems.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(8):6703–6714, 2025

2025
[47]

Zhao, D.F

X.-Y . Zhao, D.F. Sun, and K.-C. Toh. A Newton-CG augmented Lagrangian method for semidefinite programming.SIAM Journal on Optimization, 20(4):1737–1765, 2010

2010
[48]

J. Zhu, L. Liang, L. Yang, and K.-C. Toh. ripALM: A relative-type inexact proximal aug- mented Lagrangian method for linearly constrained convex optimization.arXiv preprint arXiv:2411.13267, 2024. 12 A Proof of Theorem 2.1 In this section, we provide a detailed proof for the subsequential convergence of Algorithm 1. Lemma A.1(Explicit error bound for U(a,...

work page internal anchor Pith review Pith/arXiv arXiv 2024