pith. machine review for the scientific record. sign in

arxiv: 2605.08485 · v1 · submitted 2026-05-08 · 📊 stat.ML · cs.LG· math.ST· stat.ME· stat.TH

Recognition: 2 theorem links

· Lean Theorem

Sinkhorn Treatment Effects: A Causal Optimal Transport Measure

Alex Luedtke, Medha Agarwal

Pith reviewed 2026-05-12 01:25 UTC · model grok-4.3

classification 📊 stat.ML cs.LGmath.STstat.MEstat.TH
keywords Sinkhorn treatment effectentropic optimal transportcounterfactual distributionsdistributional treatment effectsdebiased estimationpathwise differentiabilitycausal inferenceregularized optimal transport
0
0 comments X

The pith

The Sinkhorn treatment effect measures divergence between entire counterfactual distributions via entropic optimal transport and admits debiased estimators for valid tests.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper defines the Sinkhorn treatment effect as an entropic optimal transport divergence between the counterfactual outcome distributions under treatment and control. Unlike average treatment effects, this quantity registers any difference in the full shapes of those distributions. The authors express the divergence as a smooth transformation of counterfactual mean embeddings, which establishes pathwise differentiability and permits construction of debiased estimators. These estimators yield asymptotically normal limits, supporting valid tests of the null that the counterfactual distributions coincide, at a fixed regularization level. An aggregated test combines results over a grid of regularization values to guard against power loss from a poor choice of that level. If the development is correct, analysts obtain a nonparametric, distribution-wide tool for detecting treatment effects that standard mean-based methods would miss.

Core claim

The Sinkhorn treatment effect is introduced as the entropic optimal transport divergence between counterfactual distributions. This functional is shown to equal a smooth map applied to the counterfactual mean embeddings under an appropriate kernel. The smoothness yields first-order pathwise differentiability in general and second-order pathwise differentiability under the null of equal counterfactual distributions. These properties allow construction of debiased estimators that are asymptotically normal, thereby delivering asymptotically valid tests for distributional treatment effects at any fixed entropic regularization parameter. An aggregated test is further proposed that pools evidence,

What carries the argument

The Sinkhorn treatment effect, defined as the entropic optimal transport divergence between counterfactual outcome distributions and represented as a smooth functional of their mean embeddings.

If this is right

  • Debiased estimators for the Sinkhorn treatment effect converge to a normal limit at the expected rate.
  • Hypothesis tests for equality of counterfactual distributions control type-I error asymptotically at a fixed regularization level.
  • An aggregated test over a grid of regularization values combines evidence without requiring knowledge of the optimal level in advance.
  • The procedure detects distributional shifts on both simulated data and real image datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same kernel-embedding route may extend inference to other causal functionals that involve optimal transport distances.
  • The method supplies a concrete way to test for treatment effects when only shape or tail differences are expected rather than mean shifts.
  • Data-driven aggregation rules could replace the fixed grid while preserving asymptotic control.

Load-bearing premise

Counterfactual mean embeddings must exist in a reproducing kernel Hilbert space so that the entropic divergence becomes a differentiable functional of those embeddings.

What would settle it

A large-sample simulation in which the two counterfactual distributions are identical yet the test rejects the null at a rate exceeding the nominal level would refute the claim of asymptotic validity.

Figures

Figures reproduced from arXiv: 2605.08485 by Alex Luedtke, Medha Agarwal.

Figure 1
Figure 1. Figure 1: MMD vs. Sinkhorn divergence across increasing separation θ between counterfactual outcome distribution under control P0 = N (02, I2) and under treatment P1,θ = 1 2N (−θ12, I2) + 1 2N (θ12, I2). The average treatment effect is zero for all θ > 0. Here Dθ denotes either MMD or Sinkhorn divergence between P0 and P1,θ. As θ increases, the distributions diverge and MMD saturates, failing to distinguish between … view at source ↗
Figure 2
Figure 2. Figure 2: I: Type I error of the MTE and STE under null (θ = 0.0); II: Power of MTE and STE under increasing separation between counterfactual distributions (increasing θ) under Exp (i); III: Power of MTE and STE under increasing separation between counterfactual distributions (increasing θ) under Exp (ii); IV: Mean squared error of plugin vs one-step STE for θ = 1.6 from Exp (i); V: Coverage of Wald-type 95% confid… view at source ↗
Figure 3
Figure 3. Figure 3: Mean and covariance ellipsoids (95%) of counterfactual outcome distributions under varying gap between P0 and P1, parametrized by θ. Exp (i): Mean difference experiment P0 = N(02, Σ) and P1 = N(θ12, Σ). Exp (ii): Covariance difference experiment P0 = N(02, Σ) and P1 = N(02, Σ + θ∆). Simulations for aggregated test. We replicate the simulation setup of Exp (ii), but now evaluate the tests over a grid of ε v… view at source ↗
Figure 4
Figure 4. Figure 4: Type-I error and power in Exp (ii) for the aggregated procedures MTEAgg and STEAgg, together with the corresponding MTE￾and STE-based tests evaluated on a finite grid of kernel bandwidth parameters ε = ηm; m = median heuristic I.1.2. PCAM DATASET We provide here the exact data-generating mechanism used for the image-outcome experiments. For each unit, covariates are generated as X ∼ N (0, I5). Conditional … view at source ↗
Figure 5
Figure 5. Figure 5: Type 1 error (far left point) and power (all other points) of STE and MTE as a function of the treatment success probability for the PCam dataset. I.2. Compute details The code was written in Python 3 and we use PyTorch for automatic differentiation. All our experiments were conducted on a CUDA-enabled machine with 12GB GPU memory, 64GB RAM, and 24 vCPUs. Although the experiments were run on a GPU, we obse… view at source ↗
Figure 6
Figure 6. Figure 6: Wall-clock runtime (in seconds) and memory usage (in MB) for the simulation setup across increasing sample sizes n, averaged over 20 Monte Carlo simulations for both GPU and CPU implementations. I.3. Acceleration recommendations We now make the computational bottlenecks of the second-order one-step STE estimator explicit and summarize practical acceleration strategies. From Sec. 5, the total computational … view at source ↗
read the original abstract

We introduce the Sinkhorn treatment effect, an entropic optimal transport measure of divergence between counterfactual distributions. Unlike classical quantities such as the average treatment effect, this measure captures differences across entire distributions. We analyze this divergence as a statistical functional and show it can be written as a smooth transformation of counterfactual mean embeddings with an appropriate kernel. This characterization allows us to establish first-order pathwise differentiability in general, and second-order pathwise differentiability under the null hypothesis of equal counterfactual distributions. Leveraging this smoothness, we construct debiased estimators and use them to obtain asymptotically valid tests for distributional treatment effects with a fixed entropic regularization parameter. Because the power of the test depends on this unknown parameter, we further propose an aggregated test that combines evidence across a grid of regularization choices. Experiments on simulated and image data demonstrate the practical advantages of our estimator and testing procedure.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces the Sinkhorn treatment effect, an entropic optimal transport divergence between counterfactual distributions, as a measure of distributional treatment effects. It characterizes the quantity as a smooth transformation of counterfactual mean embeddings under a suitable kernel, establishes first-order pathwise differentiability in general and second-order pathwise differentiability under the null of equal counterfactual distributions, constructs debiased estimators, and derives asymptotically valid tests for fixed entropic regularization. An aggregated test over a grid of regularization values is proposed to mitigate power dependence on the unknown parameter. The approach is illustrated on simulated data and image data.

Significance. If the differentiability and asymptotic results hold, the work supplies a computationally tractable, kernel-based causal OT functional that enables rigorous inference on full distributional shifts rather than moments alone. The explicit treatment of the regularization parameter via aggregation and the construction of debiased estimators are practical strengths that could support applications in causal machine learning where testing equality of counterfactual laws is required.

major comments (1)
  1. The abstract asserts first- and second-order pathwise differentiability together with asymptotic validity of the debiased estimators, yet the provided text supplies no explicit conditions on the kernel, no error bounds, and no verification steps for the second-order expansion under the null. Without these details the support for the central claims on differentiability and test validity cannot be fully assessed.
minor comments (2)
  1. The dependence of test power on the regularization parameter is acknowledged, but the precise aggregation procedure (weights, grid construction) would benefit from an explicit algorithmic statement.
  2. Notation for the counterfactual mean embeddings and the Sinkhorn divergence should be introduced with a self-contained definition before the differentiability arguments.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful reading and constructive feedback. We have revised the manuscript to supply the missing explicit conditions, error bounds, and verification steps for the differentiability claims.

read point-by-point responses
  1. Referee: The abstract asserts first- and second-order pathwise differentiability together with asymptotic validity of the debiased estimators, yet the provided text supplies no explicit conditions on the kernel, no error bounds, and no verification steps for the second-order expansion under the null. Without these details the support for the central claims on differentiability and test validity cannot be fully assessed.

    Authors: We agree that the original submission did not provide sufficient explicit conditions or verification details. In the revised manuscript we have added a dedicated subsection (Section 3.2) stating the required kernel assumptions (bounded, continuous, and characteristic kernels with finite RKHS norm), derived explicit first- and second-order pathwise derivative bounds under these conditions, and included a full verification of the second-order expansion under the null (Appendix B.3). These additions directly support the asymptotic validity of the debiased estimators and the proposed tests. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper defines the Sinkhorn treatment effect directly as an entropic OT divergence on counterfactual distributions, represents it as a smooth functional of mean embeddings via an appropriate kernel, and derives first- and second-order pathwise differentiability from that representation using standard functional analysis. Debiased estimators and asymptotic tests follow from the differentiability, with the fixed regularization parameter explicitly acknowledged and addressed via a separate aggregation proposal. No load-bearing step reduces a claimed result to a fitted input, self-citation chain, or definitional tautology; all steps rest on external OT and statistical functional theory.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Based on the abstract alone, the central claims rest on the existence of a suitable kernel for mean embeddings and on standard regularity conditions for pathwise differentiability of statistical functionals; no explicit free parameters beyond the fixed regularization strength are named, and no new physical entities are postulated.

axioms (2)
  • domain assumption Existence of an appropriate positive definite kernel that induces the mean embeddings of counterfactual distributions
    Invoked to represent the Sinkhorn divergence as a smooth transformation of mean embeddings
  • domain assumption Standard regularity conditions for pathwise differentiability of the statistical functional
    Required to establish first- and second-order differentiability

pith-pipeline@v0.9.0 · 5449 in / 1449 out tokens · 40492 ms · 2026-05-12T01:25:00.242923+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

219 extracted references · 219 canonical work pages · 1 internal anchor

  1. [1]

    2003 , publisher=

    Topics in optimal transportation , author=. 2003 , publisher=

  2. [2]

    The Annals of Statistics , pages=

    On differentiable functionals , author=. The Annals of Statistics , pages=. 1991 , publisher=

  3. [3]

    1993 , publisher=

    Efficient and adaptive estimation for semiparametric models , author=. 1993 , publisher=

  4. [4]

    Transportation cost-information inequalities and applications to random dynamical systems and diffusions , author=. Ann. Probab. , volume=

  5. [5]

    Journal of the American statistical Association , volume=

    Estimation of regression coefficients when some regressors are not always observed , author=. Journal of the American statistical Association , volume=. 1994 , publisher=

  6. [6]

    Journal of mathematical physics , volume=

    Dynamics and kinematics of reciprocal diffusions , author=. Journal of mathematical physics , volume=. 1993 , pages=

  7. [7]

    1953 , PAGES =

    Rudin, Walter , TITLE =. 1953 , PAGES =

  8. [8]

    Optimal transport for applied mathematicians:

    Santambrogio, Filippo , volume=. Optimal transport for applied mathematicians:. 2015 , publisher=

  9. [9]

    Carlier, Guillaume and Galichon, Alfred and Santambrogio, Filippo , journal=. From. 2010 , publisher=

  10. [10]

    The annals of mathematical statistics , volume=

    Remarks on a multivariate transformation , author=. The annals of mathematical statistics , volume=. 1952 , publisher=

  11. [11]

    , author=

    Contributions to the theory of convex bodies. , author=. Michigan Mathematical Journal , volume=. 1957 , publisher=

  12. [12]

    2009 , publisher=

    Optimal transport: old and new , author=. 2009 , publisher=

  13. [13]

    International Conference on Artificial Intelligence and Statistics , volume=

    The expressive power of a class of normalizing flow models , author=. International Conference on Artificial Intelligence and Statistics , volume=. 2020 , publisher=

  14. [14]

    Handbook of Uncertainty Quantification , volume=

    An introduction to sampling via measure transport , author=. Handbook of Uncertainty Quantification , volume=. 2016 , publisher=

  15. [15]

    Advances in neural information processing systems , volume=

    Improved variational inference with inverse autoregressive flow , author=. Advances in neural information processing systems , volume=

  16. [16]

    Advances in neural information processing systems , volume=

    Masked autoregressive flow for density estimation , author=. Advances in neural information processing systems , volume=

  17. [17]

    , author=

    Normalizing Flows for Probabilistic Modeling and Inference. , author=. J. Mach. Learn. Res. , volume=

  18. [18]

    NICE: Non-linear Independent Components Estimation

    Nice: Non-linear independent components estimation , author=. arXiv preprint arXiv:1410.8516 , year=

  19. [19]

    Normalizing flows:

    Kobyzev, Ivan and Prince, Simon JD and Brubaker, Marcus A , journal=. Normalizing flows:. 2020 , publisher=

  20. [20]

    Sbornik: Mathematics , volume=

    Triangular transformations of measures , author=. Sbornik: Mathematics , volume=. 2005 , publisher=

  21. [21]

    International Conference on Machine Learning , pages=

    Neural autoregressive flows , author=. International Conference on Machine Learning , pages=. 2018 , organization=

  22. [22]

    International Conference on Machine Learning , pages=

    Sum-of-squares polynomial flow , author=. International Conference on Machine Learning , pages=. 2019 , organization=

  23. [23]

    International Conference on Machine Learning , pages=

    Input convex neural networks , author=. International Conference on Machine Learning , pages=. 2017 , organization=

  24. [24]

    A Style-Based Generator Architecture for Generative Adversarial Networks , year=

    Karras, Tero and Laine, Samuli and Aila, Timo , booktitle=. A Style-Based Generator Architecture for Generative Adversarial Networks , year=

  25. [25]

    Advances in neural information processing systems , volume=

    Unconstrained monotonic neural networks , author=. Advances in neural information processing systems , volume=

  26. [26]

    Advances in neural information processing systems , volume=

    The expressive power of neural networks: A view from the width , author=. Advances in neural information processing systems , volume=

  27. [27]

    Advances in neural information processing systems , volume=

    Resnet with one-neuron hidden layers is a universal approximator , author=. Advances in neural information processing systems , volume=

  28. [28]

    Optimal transport mapping via input convex neural networks , year=

    Makkuva, Ashol and Amirhossein, Taghvaei and Lee, Jason and Oh, Sewoong , booktitle=. Optimal transport mapping via input convex neural networks , year=

  29. [29]

    Optimal Control Via Neural Networks:

    Chen, Yize and Shi,Yuanyuan and Zhang, Baosen , booktitle=. Optimal Control Via Neural Networks:. 2018 , volume=

  30. [30]

    Kakade and Shai Shalev-Shwartz , year=

    Sham M. Kakade and Shai Shalev-Shwartz , year=. On the duality of strong convexity and strong smoothness :

  31. [31]

    Archiv der Mathematik , author=

    A note on the measurability of convex sets , volume=. Archiv der Mathematik , author=. 1986 , pages=. doi:10.1007/bf01202504 , number=

  32. [32]

    Peyr´ e and M

    Computational Optimal Transport , publisher =. 2018 , copyright =. doi:10.48550/ARXIV.1803.00567 , author =

  33. [33]

    Attention is All you Need , url =

    Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, ukasz and Polosukhin, Illia , booktitle =. Attention is All you Need , url =

  34. [34]

    and Ablin, Pierre and Blondel, Mathieu and Peyr\'e, Gabriel , booktitle =

    Sander, Michael E. and Ablin, Pierre and Blondel, Mathieu and Peyr\'e, Gabriel , booktitle =. Sinkformers:. 2022 , editor =

  35. [35]

    and Yor, M

    Revuz, D. and Yor, M. , isbn=. Continuous Martingales and. 2004 , publisher=

  36. [36]

    Advances in neural information processing systems , volume=

    Attention is all you need , author=. Advances in neural information processing systems , volume=

  37. [37]

    A survey of the

    L. A survey of the. Discrete Contin. Dyn. Syst. , FJOURNAL =. 2014 , NUMBER =. doi:10.3934/dcds.2014.34.1533 , URL =

  38. [38]

    Note on the

    R\". Note on the. Statist. Probab. Lett. , FJOURNAL =. 1993 , NUMBER =. doi:10.1016/0167-7152(93)90257-J , URL =

  39. [39]

    Gradient estimates for the

    Alberto Chiarini and Giovanni Conforti and Giacomo Greco and Luca Tamanini , year=. Gradient estimates for the. 2207.14262 , archivePrefix=

  40. [40]

    2013 , publisher=

    Introduction to Partial Differential Equations , author=. 2013 , publisher=

  41. [41]

    Entropic Optimal Transport between Unbalanced

    Janati, Hicham and Muzellec, Boris and Peyr\'. Entropic Optimal Transport between Unbalanced. 2020 , isbn =

  42. [42]

    Gradient flows in metric spaces and in the space of probability measures , SERIES =

    Ambrosio, Luigi and Gigli, Nicola and Savar\'. Gradient flows in metric spaces and in the space of probability measures , SERIES =. 2008 , PAGES =

  43. [44]

    Conforti, Giovanni and Tamanini, Luca , TITLE =. J. Funct. Anal. , FJOURNAL =. 2021 , NUMBER =. doi:10.1016/j.jfa.2021.108964 , URL =

  44. [45]

    Csisz\'. Ann. Probability , FJOURNAL =. 1975 , PAGES =. doi:10.1214/aop/1176996454 , URL =

  45. [46]

    2019 , note=

    On the difference between entropic cost and the optimal transport cost , author=. 2019 , note=

  46. [47]

    , TITLE =

    Berman, Robert J. , TITLE =. Numer. Math. , FJOURNAL =. 2020 , NUMBER =. doi:10.1007/s00211-020-01127-x , URL =

  47. [48]

    Penalized discriminant analysis.The Annals of Statistics, 23(1), February 1995

    R\". Convergence of the iterative proportional fitting procedure , JOURNAL =. 1995 , NUMBER =. doi:10.1214/aos/1176324703 , URL =

  48. [49]

    Nutz, Marcel and Wiesel, Johannes , TITLE =. Probab. Theory Related Fields , FJOURNAL =. 2022 , NUMBER =. doi:10.1007/s00440-021-01096-8 , URL =

  49. [50]

    2022 , eprint=

    Entropic estimation of optimal transport maps , author=. 2022 , eprint=

  50. [51]

    Sample Complexity of

    Aude Genevay and Lénaic Chizat and Francis Bach and Marco Cuturi and Gabriel Peyré , year=. Sample Complexity of. 1810.02733 , archivePrefix=

  51. [52]

    An entropic generalization of

    Sinho Chewi and Aram-Alexandre Pooladian , year=. An entropic generalization of. 2203.04954 , archivePrefix=

  52. [53]

    Weak semiconvexity estimates for

    Giovanni Conforti , year=. Weak semiconvexity estimates for. 2301.00083 , archivePrefix=

  53. [54]

    Gigli, Nicola and Tamanini, Luca , TITLE =. Probab. Theory Related Fields , FJOURNAL =. 2020 , NUMBER =. doi:10.1007/s00440-019-00909-1 , URL =

  54. [55]

    Lipschitz Continuity of the

    Guillaume Carlier and Lénaïc Chizat and Maxime Laborde , year=. Lipschitz Continuity of the. 2210.00225 , archivePrefix=

  55. [56]

    Carlier, Guillaume and Laborde, Maxime , TITLE =. SIAM J. Math. Anal. , FJOURNAL =. 2020 , NUMBER =. doi:10.1137/19M1253800 , URL =

  56. [57]

    An entropic generalization of

    Chewi, Sinho and Pooladian, Aram-Alexandre , journal=. An entropic generalization of

  57. [58]

    Jordan, Richard and Kinderlehrer, David and Otto, Felix , TITLE =. SIAM J. Math. Anal. , FJOURNAL =. 1998 , NUMBER =. doi:10.1137/S0036141096303359 , URL =

  58. [59]

    Weighted

    Bolley, Fran. Weighted. Ann. Fac. Sci. Toulouse Math. (6) , FJOURNAL =. 2005 , NUMBER =

  59. [60]

    , TITLE =

    Karatzas, Ioannis and Shreve, Steven E. , TITLE =. 1991 , PAGES =. doi:10.1007/978-1-4612-0949-2 , URL =

  60. [61]

    Fathi, Max and Gozlan, Nathael and Prod'homme, Maxime , TITLE =. Calc. Var. Partial Differential Equations , FJOURNAL =. 2020 , NUMBER =. doi:10.1007/s00526-020-01754-0 , URL =

  61. [62]

    Mallasto, Anton and Gerolin, Augusto and Minh, H\`a Quang , TITLE =. Inf. Geom. , FJOURNAL =. 2022 , NUMBER =. doi:10.1007/s41884-021-00052-8 , URL =

  62. [63]

    Stochastic derivatives and generalized h-transforms of

    Christian L. Stochastic derivatives and generalized h-transforms of. 2011 , eprint=

  63. [64]

    A systematic approach to

    Moucer, C. A systematic approach to. SIAM Journal on Optimization , volume=. 2023 , publisher=

  64. [65]

    Mathematical Programming , volume=

    A simplified view of first order methods for optimization , author=. Mathematical Programming , volume=. 2018 , publisher=

  65. [66]

    Zhurnal Vychislitel'noi Matematiki i Matematicheskoi Fiziki , volume=

    Gradient methods for minimizing functionals , author=. Zhurnal Vychislitel'noi Matematiki i Matematicheskoi Fiziki , volume=. 1963 , publisher=

  66. [67]

    Knight, Philip A , journal=. The. 2008 , publisher=

  67. [68]

    2014 , PAGES =

    Bakry, Dominique and Gentil, Ivan and Ledoux, Michel , TITLE =. 2014 , PAGES =. doi:10.1007/978-3-319-00227-9 , URL =

  68. [69]

    , TITLE =

    Bobkov, Sergey G. , TITLE =. Electron. J. Probab. , FJOURNAL =. 2022 , PAGES =. doi:10.1214/22-ejp834 , URL =

  69. [70]

    Stochastic analysis, filtering, and stochastic optimization , PAGES =

    Karatzas, Ioannis and Tschiderer, Bertram , TITLE =. Stochastic analysis, filtering, and stochastic optimization , PAGES =. [2022] 2022 , ISBN =. doi:10.1007/978-3-030-98519-6\_10 , URL =

  70. [71]

    ESAIM Control Optim

    Clerc, Gauthier , TITLE =. ESAIM Control Optim. Calc. Var. , FJOURNAL =. 2022 , PAGES =. doi:10.1051/cocv/2022033 , URL =

  71. [72]

    From the

    L. From the. J. Funct. Anal. , FJOURNAL =. 2012 , NUMBER =. doi:10.1016/j.jfa.2011.11.026 , URL =

  72. [73]

    2018 , publisher=

    Lectures on convex optimization , author=. 2018 , publisher=

  73. [74]

    2010 , PAGES =

    Evans, Lawrence C. , TITLE =. 1998 , PAGES =. doi:10.1090/gsm/019 , URL =

  74. [75]

    2007 , PAGES =

    Royer, Gilles , TITLE =. 2007 , PAGES =

  75. [76]

    Conforti, Giovanni and Von Renesse, Max , TITLE =. Probab. Theory Related Fields , FJOURNAL =. 2018 , NUMBER =. doi:10.1007/s00440-017-0814-9 , URL =

  76. [77]

    Advances in Neural Information Processing Systems , volume=

    Integration methods and optimization algorithms , author=. Advances in Neural Information Processing Systems , volume=

  77. [78]

    2018 , publisher =

    Vershynin, Roman , TITLE =. 2018 , PAGES =. doi:10.1017/9781108231596 , URL =

  78. [79]

    Duke Math

    Bernton, Espen and Ghosal, Promit and Nutz, Marcel , TITLE =. Duke Math. J. , FJOURNAL =. 2022 , NUMBER =. doi:10.1215/00127094-2022-0035 , URL =

  79. [80]

    2024 , note=

    The emergence of clusters in self-attention dynamics , author=. 2024 , note=

  80. [81]

    and Coifman, Ronald R

    Marshall, Nicholas F. and Coifman, Ronald R. , TITLE =. IMA J. Appl. Math. , FJOURNAL =. 2019 , NUMBER =. doi:10.1093/imamat/hxy065 , URL =

Showing first 80 references.