The Geometry of Statistical Feature Learning in Mean-Field Langevin Dynamics

Guillaume Lecu\'e; Taiji Suzuki; Tomoya Wakayama; Zong Shang

arxiv: 2606.31429 · v1 · pith:HK27RIZUnew · submitted 2026-06-30 · 🧮 math.ST · stat.TH

The Geometry of Statistical Feature Learning in Mean-Field Langevin Dynamics

Zong Shang , Tomoya Wakayama , Guillaume Lecu\'e , Taiji Suzuki This is my paper

Pith reviewed 2026-07-01 03:27 UTC · model grok-4.3

classification 🧮 math.ST stat.TH

keywords mean-field Langevin dynamicsGaussian multi-index modelsWasserstein gradient flownegative entropy regularizationfeature learningconcentration phenomenaparameter recoverysingle-index models

0 comments

The pith

Spherical mean-field Langevin dynamics concentrate near hidden indices in Gaussian multi-index models at low temperatures, producing multi-spike stationary distributions that recover parameters with high probability despite negative entropy

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper defines statistical feature learning geometrically through a base-fiber decomposition in which training produces a feature-side base geometry and a learned fiber space for estimation. It proves this structure holds for spherical mean-field Langevin dynamics interpreted as the Wasserstein gradient flow of negative entropy-regularized empirical risk. In Gaussian multi-index models the low-temperature stationary distribution concentrates near the hidden indices, forms a multi-spike structure, and achieves parameter recovery with high probability, with a sharp transition at temperature λ ≃ 1. In single-index models the stationary measure obeys a Lévy-Milman concentration property whose support depends on parity, and the induced feature space aligns the regression signal to deliver statistical rates of order d/N and Md/N up to logarithmic factors. A reader would care because the result shows how the dynamics can discover relevant directions even though the regularization term penalizes concentration.

Core claim

For spherical mean-field Langevin dynamics viewed as the Wasserstein gradient flow of negative entropy-regularized empirical risk, the low-temperature stationary distribution in Gaussian multi-index models concentrates near the hidden indices, forms a multi-spike structure, and yields parameter recovery with high probability, even though negative entropy regularization penalizes concentration; this concentration exhibits a sharp transition at temperature λ ≃ 1. In Gaussian single-index models the stationary measure satisfies a Lévy-Milman concentration property, with parity determining whether it lives on the sphere S^{d-1} or the projective space RP^{d-1}. The induced learned feature space

What carries the argument

The base-fiber decomposition of statistical feature learning, in which the base is the feature-side geometry produced by training and the fiber is the learned feature space where estimation occurs; it links the dynamics directly to the geometry of the stationary measure.

If this is right

Parameter recovery occurs with high probability in multi-index models once temperature drops below the threshold near 1.
The multi-spike structure persists even though negative entropy regularization penalizes concentration.
In single-index models the stationary measure lives on the sphere or the projective space according to parity.
The aligned feature space yields statistical rates of order d/N and Md/N up to logarithmic factors.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The base-fiber geometry may extend to other gradient-flow formulations of training beyond the spherical mean-field case.
The sharp temperature threshold could be used to tune regularization strength in practice so that concentration occurs without explicit feature selection.
Parity dependence in single-index models suggests that sign-flip symmetries in the data affect the topology of the learned feature space.

Load-bearing premise

The spherical mean-field Langevin dynamics are exactly the Wasserstein gradient flow of the negative entropy-regularized empirical risk and the data are drawn from a Gaussian multi-index or single-index model.

What would settle it

Simulate the spherical mean-field Langevin dynamics on Gaussian multi-index data and check whether the empirical stationary distribution concentrates in small neighborhoods of the hidden indices when the temperature parameter is below 1 but spreads out when the temperature exceeds 1.

read the original abstract

We introduce a geometric formulation of statistical feature learning for supervised regression. Feature learning is defined through a base--fiber decomposition: the base is the feature-side geometry produced by training, and the fiber is the learned feature space where estimation is performed. We prove this property for spherical mean-field Langevin dynamics, viewed as the Wasserstein gradient flow of a negative entropy-regularized empirical risk. In Gaussian multi-index models, the low-temperature stationary distribution concentrates near the hidden indices, forms a multi-spike structure, and yields parameter recovery with high probability, even though negative entropy regularization penalizes concentration. This concentration has a sharp transition at temperature $\lambda\asymp 1$. In Gaussian single-index models, the stationary measure satisfies a L\'evy--Milman concentration property, with parity determining whether it lives on $S_2^{d-1}$ or $\mathbb{RP}^{d-1}$. The induced learned feature space aligns the regression signal and yields rates $d/N$ and $Md/N$, up to logarithmic factors.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The base-fiber decomposition and the multi-spike concentration with sharp λ≃1 transition in Gaussian multi-index models are the actual new statements here.

read the letter

The paper's main contribution is a geometric framing of feature learning via base-fiber decomposition, applied to spherical mean-field Langevin dynamics treated as the Wasserstein gradient flow of negative-entropy regularized risk. In Gaussian multi-index models it claims the low-temperature stationary measure concentrates near the hidden indices, produces a multi-spike structure, and still recovers parameters with high probability despite the regularizer. A sharp transition appears at λ ≃ 1. The single-index case adds a Lévy-Milman concentration property whose support depends on parity.

These statements are presented as new and not reducible to the cited prior work. The rates d/N and Md/N (up to logs) for the induced feature space follow from the alignment with the regression signal. The modeling assumptions are explicit: Gaussian data and the exact identification of the dynamics with the gradient flow.

The central claims rest on standard Wasserstein theory plus Gaussian concentration tools, so the circularity burden looks low. The main soft spot is that the abstract asserts the proofs and error bounds without supplying the derivations or the precise handling of the mean-field limit; that makes it impossible to check whether hidden assumptions affect the sharp transition or the recovery probability. If the full proofs close those gaps cleanly, the geometric language could organize thinking about feature recovery inside an explicit optimization process.

This is for readers already working on mean-field limits or high-dimensional index models in theoretical statistics. It is narrow enough that not every stats group needs it, but the specific concentration results are concrete enough to deserve referee time rather than a desk reject.

Referee Report

2 major / 2 minor

Summary. The paper introduces a geometric formulation of statistical feature learning via a base-fiber decomposition, where the base captures feature-side geometry from training and the fiber is the learned feature space for estimation. It proves that spherical mean-field Langevin dynamics are the Wasserstein gradient flow of negative entropy-regularized empirical risk. In Gaussian multi-index models the low-temperature stationary distribution concentrates near hidden indices, forms a multi-spike structure, and enables high-probability parameter recovery despite the regularization penalty, with a sharp transition at temperature λ ≃ 1. In Gaussian single-index models the stationary measure satisfies a Lévy-Milman concentration property (with parity determining the ambient space), the learned features align the regression signal, and recovery rates of order d/N and Md/N (up to logs) are obtained.

Significance. If the concentration, multi-spike structure, and sharp transition results are rigorously established, the work supplies a geometric and optimal-transport perspective on feature learning in mean-field Langevin dynamics that connects statistical recovery rates to Wasserstein gradient flows. The use of standard Gaussian concentration tools together with the claimed parameter-free aspects of the derivations would be a strength; the explicit rates in single- and multi-index settings could inform high-dimensional learning theory.

major comments (2)

[Abstract / Introduction] The abstract asserts proofs of concentration, multi-spike structure, and the sharp transition at λ ≃ 1, yet the provided description contains no explicit theorem statements, error bounds, or handling of the mean-field limit; without these derivations it is impossible to confirm that the modeling assumptions (spherical dynamics exactly matching the Wasserstein flow, Gaussian multi-index data) do not introduce post-hoc gaps that affect the central recovery claims.
[Section introducing base-fiber decomposition] The base-fiber decomposition is presented as the central geometric object, but its precise definition (how the base is extracted from the stationary measure and how the fiber is constructed for estimation) is not visible; this definition is load-bearing for the claim that the framework applies to supervised regression.

minor comments (2)

[Abstract] Clarify whether λ ≃ 1 denotes asymptotic equivalence, a specific numerical threshold, or an order-of-magnitude statement; tie the notation to the precise statement of the transition theorem.
[Results on recovery rates] The rates d/N and Md/N are stated up to logarithmic factors; specify the precise dependence on the number of indices M and any hidden constants in the theorems.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful reading and comments. We address each major comment below.

read point-by-point responses

Referee: [Abstract / Introduction] The abstract asserts proofs of concentration, multi-spike structure, and the sharp transition at λ ≃ 1, yet the provided description contains no explicit theorem statements, error bounds, or handling of the mean-field limit; without these derivations it is impossible to confirm that the modeling assumptions (spherical dynamics exactly matching the Wasserstein flow, Gaussian multi-index data) do not introduce post-hoc gaps that affect the central recovery claims.

Authors: The abstract summarizes the main contributions at a high level. Explicit theorem statements appear in the body: Theorem 3.1 establishes the Wasserstein gradient flow property for the spherical dynamics; Theorems 3.2 and 3.4 give the concentration, multi-spike structure, and sharp transition at λ ≃ 1 with explicit high-probability bounds; the mean-field limit is controlled in the proofs of Section 3 via standard propagation-of-chaos arguments. The Gaussian multi-index model is the standing assumption from the outset, and the spherical-to-Wasserstein equivalence is derived directly in Proposition 2.1 without post-hoc adjustments. We can add a short statement of the main theorems to the introduction for clarity. revision: partial
Referee: [Section introducing base-fiber decomposition] The base-fiber decomposition is presented as the central geometric object, but its precise definition (how the base is extracted from the stationary measure and how the fiber is constructed for estimation) is not visible; this definition is load-bearing for the claim that the framework applies to supervised regression.

Authors: Section 2.1 defines the decomposition: the base is the feature-side geometry given by the marginal of the stationary measure on the sphere (extracted via its support and second-moment matrix), while the fiber is the learned feature space obtained by regressing the labels onto the coordinates aligned with the base. We agree that the extraction and construction steps can be stated more formally and visibly, and will revise Section 2.1 to include an explicit definition with the precise maps from the stationary measure. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper's central claims rest on viewing spherical mean-field Langevin dynamics as the Wasserstein gradient flow of a negative entropy-regularized empirical risk, then applying standard Gaussian concentration and Lévy-Milman tools to analyze the low-temperature stationary measure in multi-index and single-index models. These steps invoke external mathematical frameworks (Wasserstein geometry, concentration inequalities) whose validity does not depend on quantities fitted from the same data or on self-citations whose content reduces to the present results. No equation equates a derived recovery rate or transition threshold to a parameter fit by construction, and the base-fiber decomposition is introduced as a definition rather than smuggled via prior self-work. The derivation therefore remains independent of its own outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claims rest on the identification of the dynamics with a Wasserstein gradient flow, the Gaussian index model assumptions, and standard concentration-of-measure results; no free parameters are fitted inside the proofs, but the temperature λ is a model parameter whose critical value is derived rather than estimated from data.

axioms (2)

domain assumption Spherical mean-field Langevin dynamics coincide with the Wasserstein gradient flow of the negative-entropy-regularized empirical risk
Explicitly stated in the abstract as the modeling choice that enables the geometric analysis.
domain assumption Data are generated from a Gaussian multi-index or single-index model
Required for the stated concentration and Lévy-Milman properties to hold.

invented entities (1)

base-fiber decomposition no independent evidence
purpose: To separate the feature-side geometry produced by training from the learned feature space used for estimation
Introduced as the central geometric formulation of statistical feature learning

pith-pipeline@v0.9.1-grok · 5715 in / 1550 out tokens · 55267 ms · 2026-07-01T03:27:08.027154+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 6 canonical work pages · 1 internal anchor

[1]

High-dimensional limit theorems for SGD: Effective dynamics and critical scaling.Communications on Pure and Applied Mathematics, 77(3):2030–2080,

[BAGJ24] G´ erard Ben Arous, Reza Gheissari, and Aukosh Jagannath. High-dimensional limit theorems for SGD: Effective dynamics and critical scaling.Communications on Pure and Applied Mathematics, 77(3):2030–2080,

2030
[2]

Gradient descent on infinitely wide neural networks: global convergence and generalization

[BC23] Francis Bach and L´ ena¨ ıc Chizat. Gradient descent on infinitely wide neural networks: global convergence and generalization. InProceedings of the International Congress of Mathematicians 2022, pages 5398–5419. EMS Press,

2022
[3]

Diffusions hypercontractives

[B´E85] Dominique Bakry and Michel ´Emery. Diffusions hypercontractives. InS´ eminaire de Probabilit´ es XIX 1983/84, volume 1123 ofLecture Notes in Mathematics, pages 177–

1983
[4]

A mathematical perspective of machine learning

[E23] Weinan E. A mathematical perspective of machine learning. InProceedings of the International Congress of Mathematicians (ICM 2022), volume 2, pages 914–954. EMS Press, December

2022
[5]

[GBC16] Ian Goodfellow, Yoshua Bengio, and Aaron Courville.Deep Learning

eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1002/cpa.22032. [GBC16] Ian Goodfellow, Yoshua Bengio, and Aaron Courville.Deep Learning. MIT Press,

work page doi:10.1002/cpa.22032
[6]

Disentangling feature and lazy training in deep neural networks.Journal of Statistical Mechanics: Theory and Experiment, 2020(11):113301,

[GSJW20] Mario Geiger, Stefano Spigler, Arthur Jacot, and Matthieu Wyart. Disentangling feature and lazy training in deep neural networks.Journal of Statistical Mechanics: Theory and Experiment, 2020(11):113301,

2020
[7]

arXiv preprint arXiv:2505.04898 , year=

[HI25] Qiyang Han and Masaaki Imaizumi. Precise gradient descent training dynamics for finite-width multi-layer neural networks.arXiv preprint arXiv:2505.04898,

work page arXiv
[8]

Mean-field Langevin dy- namics and energy landscape of neural networks.Annales de l’Institut Henri Poincar´ e, Probabilit´ es et Statistiques, 57(4):2043–2065,

[HRˇSS21] Kaitong Hu, Zhenjie Ren, David ˇSiˇ ska, and Lukasz Szpruch. Mean-field Langevin dy- namics and energy landscape of neural networks.Annales de l’Institut Henri Poincar´ e, Probabilit´ es et Statistiques, 57(4):2043–2065,

2043
[9]

[Kol06] Vladimir Koltchinskii

eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1002/cpa.22051. [Kol06] Vladimir Koltchinskii. Local Rademacher complexities and oracle inequalities in risk minimization.The Annals of Statistics, 34(6):2593–2656,

work page doi:10.1002/cpa.22051
[10]

Springer, Berlin, Heidelberg,

[Kol11] Vladimir Koltchinskii.Oracle Inequalities in Empirical Risk Minimization and Sparse Recovery Problems: ´Ecole d’ ´Et´ e de Probabilit´ es de Saint-Flour XXXVIII-2008, volume 2033 ofLecture Notes in Mathematics. Springer, Berlin, Heidelberg,

2008
[11]

Sharp convergence rates for Spectral methods via the feature space decomposition method

[LLS25] Guillaume Lecu´ e, Zhifan Li, and Zong Shang. Sharp convergence rates for spectral methods via the feature space decomposition method.arXiv preprint arXiv:2512.14473,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

[MB23] Pierre Marion and Rapha¨ el Berthier

eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1002/cpa.3160370408. [MB23] Pierre Marion and Rapha¨ el Berthier. Leveraging the two-timescale regime to demon- strate convergence of neural networks. InAdvances in Neural Information Processing Systems, volume 36, pages 64996–65029,

work page doi:10.1002/cpa.3160370408
[13]

Phase transitions for feature learning in neural networks.arXiv preprint arXiv:2602.01434,

[MW26] Andrea Montanari and Zihao Wang. Phase transitions for feature learning in neural networks.arXiv preprint arXiv:2602.01434,

work page arXiv
[14]

Rotskoff and Eric Vanden-Eijnden

[RVE22] Grant M. Rotskoff and Eric Vanden-Eijnden. Trainability and accuracy of artificial neural networks: An interacting particle system approach.Communications on Pure and Applied Mathematics, 75(9):1889–1935,

1935

[1] [1]

High-dimensional limit theorems for SGD: Effective dynamics and critical scaling.Communications on Pure and Applied Mathematics, 77(3):2030–2080,

[BAGJ24] G´ erard Ben Arous, Reza Gheissari, and Aukosh Jagannath. High-dimensional limit theorems for SGD: Effective dynamics and critical scaling.Communications on Pure and Applied Mathematics, 77(3):2030–2080,

2030

[2] [2]

Gradient descent on infinitely wide neural networks: global convergence and generalization

[BC23] Francis Bach and L´ ena¨ ıc Chizat. Gradient descent on infinitely wide neural networks: global convergence and generalization. InProceedings of the International Congress of Mathematicians 2022, pages 5398–5419. EMS Press,

2022

[3] [3]

Diffusions hypercontractives

[B´E85] Dominique Bakry and Michel ´Emery. Diffusions hypercontractives. InS´ eminaire de Probabilit´ es XIX 1983/84, volume 1123 ofLecture Notes in Mathematics, pages 177–

1983

[4] [4]

A mathematical perspective of machine learning

[E23] Weinan E. A mathematical perspective of machine learning. InProceedings of the International Congress of Mathematicians (ICM 2022), volume 2, pages 914–954. EMS Press, December

2022

[5] [5]

[GBC16] Ian Goodfellow, Yoshua Bengio, and Aaron Courville.Deep Learning

eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1002/cpa.22032. [GBC16] Ian Goodfellow, Yoshua Bengio, and Aaron Courville.Deep Learning. MIT Press,

work page doi:10.1002/cpa.22032

[6] [6]

Disentangling feature and lazy training in deep neural networks.Journal of Statistical Mechanics: Theory and Experiment, 2020(11):113301,

[GSJW20] Mario Geiger, Stefano Spigler, Arthur Jacot, and Matthieu Wyart. Disentangling feature and lazy training in deep neural networks.Journal of Statistical Mechanics: Theory and Experiment, 2020(11):113301,

2020

[7] [7]

arXiv preprint arXiv:2505.04898 , year=

[HI25] Qiyang Han and Masaaki Imaizumi. Precise gradient descent training dynamics for finite-width multi-layer neural networks.arXiv preprint arXiv:2505.04898,

work page arXiv

[8] [8]

Mean-field Langevin dy- namics and energy landscape of neural networks.Annales de l’Institut Henri Poincar´ e, Probabilit´ es et Statistiques, 57(4):2043–2065,

[HRˇSS21] Kaitong Hu, Zhenjie Ren, David ˇSiˇ ska, and Lukasz Szpruch. Mean-field Langevin dy- namics and energy landscape of neural networks.Annales de l’Institut Henri Poincar´ e, Probabilit´ es et Statistiques, 57(4):2043–2065,

2043

[9] [9]

[Kol06] Vladimir Koltchinskii

eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1002/cpa.22051. [Kol06] Vladimir Koltchinskii. Local Rademacher complexities and oracle inequalities in risk minimization.The Annals of Statistics, 34(6):2593–2656,

work page doi:10.1002/cpa.22051

[10] [10]

Springer, Berlin, Heidelberg,

[Kol11] Vladimir Koltchinskii.Oracle Inequalities in Empirical Risk Minimization and Sparse Recovery Problems: ´Ecole d’ ´Et´ e de Probabilit´ es de Saint-Flour XXXVIII-2008, volume 2033 ofLecture Notes in Mathematics. Springer, Berlin, Heidelberg,

2008

[11] [11]

Sharp convergence rates for Spectral methods via the feature space decomposition method

[LLS25] Guillaume Lecu´ e, Zhifan Li, and Zong Shang. Sharp convergence rates for spectral methods via the feature space decomposition method.arXiv preprint arXiv:2512.14473,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

[MB23] Pierre Marion and Rapha¨ el Berthier

eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1002/cpa.3160370408. [MB23] Pierre Marion and Rapha¨ el Berthier. Leveraging the two-timescale regime to demon- strate convergence of neural networks. InAdvances in Neural Information Processing Systems, volume 36, pages 64996–65029,

work page doi:10.1002/cpa.3160370408

[13] [13]

Phase transitions for feature learning in neural networks.arXiv preprint arXiv:2602.01434,

[MW26] Andrea Montanari and Zihao Wang. Phase transitions for feature learning in neural networks.arXiv preprint arXiv:2602.01434,

work page arXiv

[14] [14]

Rotskoff and Eric Vanden-Eijnden

[RVE22] Grant M. Rotskoff and Eric Vanden-Eijnden. Trainability and accuracy of artificial neural networks: An interacting particle system approach.Communications on Pure and Applied Mathematics, 75(9):1889–1935,

1935