arxiv: 2604.27883 · v1 · submitted 2026-04-30 · 🧮 math.ST · cs.IT· cs.LG· math.IT· stat.ML· stat.TH

Recognition: unknown

Decoupled Descent: Exact Test Error Tracking Via Approximate Message Passing

Max Lovig

Pith reviewed 2026-05-07 06:27 UTC · model grok-4.3

classification 🧮 math.ST cs.ITcs.LGmath.ITstat.MLstat.TH

keywords decoupled descentapproximate message passingGaussian mixture modelsgeneralization gaptrain-test identitystate evolutionfull-batch gradient descent

0 comments

The pith

Decoupled descent cancels data reuse biases so that training error asymptotically tracks test error in Gaussian mixture models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces decoupled descent, a training algorithm that uses approximate message passing to remove the systematic biases that arise when the same data is reused in full-batch gradient descent. This produces a train-test identity in which the observed training error converges to the true test error for stylized Gaussian mixture models. A sympathetic reader would care because the approach demonstrates that zero-cost validation and full data utilization are possible in this regime without needing a held-out set. The algorithm is further governed by a low-dimensional state evolution recursion that makes the entire training trajectory transparent and computable.

Core claim

Decoupled descent (DD) is a novel training algorithm that, by leveraging approximate message passing theory, iteratively cancels the biases induced by data reuse in full-batch gradient descent. For stylized Gaussian mixture models, this results in the training error asymptotically tracking the test error, satisfying a train-test identity. The algorithm's dynamics are described by a low-dimensional recursion, and it demonstrates superior performance on tasks like XOR classification compared to standard gradient descent, while narrowing the gap even on more complex datasets like noisy MNIST and CIFAR-10 when assumptions are relaxed.

What carries the argument

The decoupled descent algorithm, which separates parameter updates to enable exact iterative cancellation of data-reuse biases through approximate message passing state evolution.

Load-bearing premise

The data must follow stylized Gaussian mixture models for which approximate message passing theory can exactly cancel all biases arising from data reuse during training.

What would settle it

A simulation on Gaussian mixture data in which decoupled descent is run but the training and test errors fail to converge to the same asymptotic value would falsify the train-test identity.

Figures

Figures reproduced from arXiv: 2604.27883 by Max Lovig.

**Figure 1.** Figure 1: Summary statistics for 20 XOR runs (n = d = 1000): GD (left) vs. DD (right) with η = 0.05 and SNR λ = 1 (upper left), λ = 4 (upper right) and λ = 8 (bottom). Blue/red denote train/test error; solid lines are medians, shaded areas are interquartile ranges, and dotted lines show min/max. Low SNR (λ = 1): GD overfits (low train/high test error); DD stabilizes both near log(2), reflecting the non-informative r… view at source ↗

**Figure 2.** Figure 2: Summary statistics for 50 XOR runs (n = d = 1000) of damped DD (defined in Appendix B.2): η = 0.05, fixed at = 1, λ = 4 and η0 = 1 (left), η0 = 0.9 (middle) and η0 = 0.8 (right). Blue/red colors and line type are equivalent to view at source ↗

**Figure 3.** Figure 3: MNIST zeros vs. eights train/test errors (d = 784, n = 800, λ = 30) over 20 replications: GD (left) vs. DD (right) with discrete noise from distribution (5.1). Blue/red denote train/test error; solid lines are medians, shaded areas are IQRs, dotted lines are min/max. We run a two layer network with hidden layers (L ∈ {3, 9, 27}). The train-test error identity continues to hold as L grows and when the Gauss… view at source ↗

**Figure 4.** Figure 4: Train/test errors for CIFAR-10 (cats vs. dogs). GD (left) vs. DD (right). Blue/red denote train/test error; solid lines are medians, shaded areas cover the inter-quartile range, dotted lines show min/max, and title is the method of whitening (see Section 5.3). DD reduces overfitting effects compared to GD for training a classification head to ResNet-18 embeddings. Limitations While DD guarantees an exact a… view at source ↗

**Figure 5.** Figure 5: Summary stats for 100 signal-less regression runs (n = 200, d = 800): GD (top) vs. DD (bottom) with η = 0.05. Blue/red denote train/test error; solid lines are medians, shaded areas are interquartile ranges, and dotted lines show min/max. Notice, by design, the trajectories of the train and test error are identical for pure DD while they significantly diverge for GD. GD Pure DD (h 1 t , h2 t ) = X(θ 1 t , … view at source ↗

**Figure 6.** Figure 6: An example of hyperparameter tuning with DD over 100 individual replications, in each figure is the train and test error for damped pure DD algorithm in the signal-less regression problem with n = 800, d = 800, η = 0.05 and we sweep c over the sub-space given in (B.1). Blue lines refer to train error and red lines refer to test error; the solid line is the median error, the shaded region is the range from … view at source ↗

**Figure 7.** Figure 7: Empirical density estimates for runtime on a width-9 two-layer neural network (25 replications, 500 epochs). Left: Distribution of per-epoch clock time (ms) for GD vs. DD. Right: Total training time (s) per replication for the MNIST zeros/eights problem. B.3. A Run-time Analysis For The MNIST Problem view at source ↗

read the original abstract

In modern parametric model training, full-batch gradient descent (and its variants) suffers due to progressively stronger biasing towards the exact realization of training data; this drives the systematic ``generalization gap'', where the train error becomes an unreliable proxy for test error. Existing approaches either argue this gap is benign through complex analysis or sacrifice data to a validation set. In contrast, we introduce decoupled descent (DD), a novel theory-based training algorithm that satisfies a train-test identity -- enforcing the train error to asymptotically track the test error for stylized Gaussian mixture models. Within this specific regime, leveraging approximate message passing theory, DD iteratively cancels the biases due to data reuse, rigorously demonstrating the feasibility of zero-cost validation and $100\%$ data utilization. Moreover, DD is governed by a low-dimensional state evolution recursion, rendering the dynamics of the algorithm transparent and tractable. We validate DD on XOR classification, yielding superior performance compared to GD; additionally, we implement noisy MNIST and non-linear probing of CIFAR-10, demonstrating that even when our stylized assumptions are relaxed, DD narrows the generalization gap compared to GD.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Decoupled descent uses AMP to enforce asymptotic train-test error identity for GMMs, with a clean state evolution but narrow assumptions and limited real-data payoff.

read the letter

Colleague, the core of this paper is a new training procedure called decoupled descent that, for Gaussian mixture models in the high-dimensional regime, uses approximate message passing to cancel the biases that normally appear when you reuse training data in gradient steps. The result is that the training error asymptotically matches the test error, which would let you train on the full dataset without a validation split. They also derive a low-dimensional state evolution that tracks the whole process in closed form, which is a genuine plus for transparency. The experiments on XOR classification beat plain GD, and on noisy MNIST plus CIFAR-10 probing the gap narrows even when the GMM assumption is relaxed. That is the part that works. The soft spot is exactly the one the stress-test note flags: the decoupling construction has to preserve the independence conditions that standard AMP relies on. If the specific form of the updates introduces extra correlations or if the Onsager corrections and effective noise variances do not line up exactly between the reused training view and the independent test view, the claimed identity will not hold. The abstract asserts a rigorous demonstration, but without the explicit state-evolution equations in front of me it is hard to judge how cleanly that matching is shown. The real-data gains are only a narrowing, not an elimination, which is expected but keeps the practical scope modest. This is for theorists who already work with AMP and high-dimensional limits of iterative algorithms. Someone analyzing generalization or validation efficiency in parametric models could extract the state-evolution technique. I would send it to peer review. The construction is interesting enough that experts should check the AMP application and the derivations carefully.

Referee Report

2 major / 2 minor

Summary. The paper introduces decoupled descent (DD), a novel training algorithm that leverages approximate message passing (AMP) theory to iteratively cancel biases arising from data reuse in gradient descent. For stylized Gaussian mixture models, DD is claimed to enforce an asymptotic train-test error identity, governed by a low-dimensional state evolution recursion that renders the dynamics tractable. Empirical results on XOR classification, noisy MNIST, and non-linear probing of CIFAR-10 are presented to show that DD narrows the generalization gap relative to standard GD, even when the GMM assumptions are relaxed.

Significance. If the central claim holds, the work offers a theoretically grounded approach to zero-cost validation and full data utilization in high-dimensional parametric training, with the AMP-derived state evolution providing a transparent, low-dimensional description of the dynamics. The grounding in established AMP literature is a strength, as it supplies independent justification for the recursions rather than ad-hoc fitting. The empirical narrowing of the gap on real data suggests broader applicability, though the primary value lies in the stylized regime where the identity is rigorously derived.

major comments (2)

[§4] §4, the state evolution derivation: the claim that the train-error recursion is identical to the test-error recursion after each DD iteration requires explicit verification that the Onsager correction exactly equates the effective noise variances in the reused-training view and the independent-test view. Without this matching shown for the GMM loss and gradient, the asymptotic identity does not follow from standard AMP assumptions.
[§3.2] §3.2, the decoupled update rule: the construction introduces model-dependent correlations between the gradient steps and the data; it is not shown that these correlations vanish in the high-dimensional limit with fixed aspect ratio, which is necessary to preserve the independence assumptions underlying the closed state evolution.

minor comments (2)

[Experimental section] The experimental section should report the precise procedure used to initialize or fit the state-evolution parameters on the real-data tasks (noisy MNIST, CIFAR-10), including any sensitivity analysis.
[Figures] Figure captions for the XOR and MNIST results should explicitly state the number of independent runs and the precise definition of the plotted error quantities (e.g., whether they are averaged over the state-evolution trajectory).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive major comments. These points highlight areas where additional explicit verification will strengthen the rigor of the derivations. We address each comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [§4] §4, the state evolution derivation: the claim that the train-error recursion is identical to the test-error recursion after each DD iteration requires explicit verification that the Onsager correction exactly equates the effective noise variances in the reused-training view and the independent-test view. Without this matching shown for the GMM loss and gradient, the asymptotic identity does not follow from standard AMP assumptions.

Authors: We agree that an explicit verification of the noise-variance matching is required to complete the argument. In the revised manuscript we will insert a dedicated calculation in §4 that computes the effective noise variances for the GMM loss and gradient under both the reused-training and independent-test views. We will show that the Onsager correction term arising from the decoupled descent update exactly equates these variances, thereby confirming that the train-error and test-error recursions coincide asymptotically from the standard AMP assumptions. revision: yes
Referee: [§3.2] §3.2, the decoupled update rule: the construction introduces model-dependent correlations between the gradient steps and the data; it is not shown that these correlations vanish in the high-dimensional limit with fixed aspect ratio, which is necessary to preserve the independence assumptions underlying the closed state evolution.

Authors: We acknowledge that the vanishing of the model-dependent correlations introduced by the decoupled update was not shown explicitly. In the revision we will add a short lemma (or remark) in §3.2 that invokes standard AMP concentration and self-averaging results to prove that, in the high-dimensional limit with fixed aspect ratio, these correlations become o(1) and therefore do not affect the independence assumptions required for the closed state evolution. revision: yes

Circularity Check

0 steps flagged

No significant circularity; central claim follows from external AMP theory applied to a constructed algorithm

full rationale

The paper defines decoupled descent (DD) explicitly to cancel data-reuse biases using the Onsager correction terms and state-evolution recursions supplied by established approximate message passing (AMP) literature. The train-test identity is then shown to hold asymptotically because the effective noise variances and correlation terms in the low-dimensional state evolution match between the adjusted training iterates and an independent test view; this matching is a derived consequence of the AMP analysis rather than an input assumption or a fitted parameter renamed as a prediction. No load-bearing step reduces to a self-citation chain, self-definition, or ansatz smuggled from the authors' prior work. The derivation therefore remains self-contained once the external AMP results (which are independently verifiable and not dependent on the present paper's target identity) are granted.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim depends on the stylized Gaussian mixture assumption and the validity of AMP analysis for this training procedure; no new physical entities are postulated.

free parameters (1)

state evolution parameters
Low-dimensional recursion likely depends on distribution-specific quantities that may be derived or set from the model.

axioms (1)

domain assumption Approximate message passing theory accurately describes the iterative dynamics of decoupled descent on Gaussian mixture models
Invoked to derive the train-test identity and bias cancellation.

pith-pipeline@v0.9.0 · 5496 in / 1377 out tokens · 117258 ms · 2026-05-07T06:27:25.573312+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

31 extracted references · 16 canonical work pages

[1]

Z. D. Bai and Y. Q. Yin. Limit of the smallest eigenvalue of a large dimensional sample covariance matrix.The Annals of Probability, 21(3), July 1993. ISSN 0091-1798. DOI: https://doi.org/10.1214/aop/1176989118. URL http://dx.doi.org/10.1214/aop/1176989118

work page doi:10.1214/aop/1176989118 1993
[2]

A leave-one-out approach to approximate message passing.The Annals of Applied Probability, 35(4), August 2025

Zhigang Bao, Qiyang Han, and Xiaocong Xu. A leave-one-out approach to approximate message passing.The Annals of Applied Probability, 35(4), August 2025. ISSN 1050-5164. DOI: https://doi.org/10.1214/25-aap2186. URL http://dx.doi.org/10.1214/25-aap2186

work page doi:10.1214/25-aap2186 2025
[3]

Universality in polytope phase transitions and message passing algorithms.The Annals of Applied Probability, 25(2), apr 2015

Mohsen Bayati, Marc Lelarge, and Andrea Montanari. Universality in polytope phase transitions and message passing algorithms.The Annals of Applied Probability, 25(2), apr 2015. DOI: https://doi.org/10.1214/14-aap1010

work page doi:10.1214/14-aap1010 2015
[4]

State evolution for approximate message passing with non-separable functions.Information and Inference: A Journal of the IMA, 9(1):33–79, January 2019

Raphaël Berthier, Andrea Montanari, and Phan-Minh Nguyen. State evolution for approximate message passing with non-separable functions.Information and Inference: A Journal of the IMA, 9(1):33–79, January 2019. ISSN 2049-8772. DOI: https://doi.org/10.1093/imaiai/iay021. URL http://dx.doi.org/10.1093/imaiai/iay021

work page doi:10.1093/imaiai/iay021 2019
[5]

Universality of approximate message passing algo- rithms.Electronic Journal of Probability, 26(none), January 2021

Wei-Kuo Chen and Wai-Kit Lam. Universality of approximate message passing algo- rithms.Electronic Journal of Probability, 26(none), January 2021. ISSN 1083-6489. DOI: https://doi.org/10.1214/21-ejp604. URL http://dx.doi.org/10.1214/21-ejp604

work page doi:10.1214/21-ejp604 2021
[6]

Sequential dynamics in ising spin glasses, 2025

Yatin Dandi, David Gamarnik, Francisco Pernice, and Lenka Zdeborová. Sequential dynamics in ising spin glasses, 2025

2025
[7]

Donoho, Arian Maleki, and Andrea Montanari

David L. Donoho, Arian Maleki, and Andrea Montanari. Message passing algorithms for compressed sensing: I. motivation and construction. InIEEE Information Theory Workshop 2010 (ITW 2010), page 1–5. IEEE, January 2010. DOI: https://doi.org/10.1109/itwksps.2010.5503193. URL http://dx.doi.org/10.1109/itwksps.2010.5503193

work page doi:10.1109/itwksps.2010.5503193 2010
[8]

Approximate message passing algorithms for rotationally invariant matrices.The Annals of Statistics, 50(1), February 2022

Zhou Fan. Approximate message passing algorithms for rotationally invariant matrices.The Annals of Statistics, 50(1), February 2022. ISSN 0090-5364. DOI: https://doi.org/10.1214/21- aos2101. URL http://dx.doi.org/10.1214/21-aos2101

work page doi:10.1214/21- 2022
[9]

High-dimensional learning dynamics of multi-pass stochastic gradient descent in multi-index models, 2026

Zhou Fan and Leda Wang. High-dimensional learning dynamics of multi-pass stochastic gradient descent in multi-index models, 2026

2026
[10]

Feng, Ramji Venkataramanan, Cynthia Rush, and Richard J

Oliver Y. Feng, Ramji Venkataramanan, Cynthia Rush, and Richard J. Samworth.A Unifying Tutorial on Approximate Message Passing. Now Publishers, 2022. ISBN 9781638280057. DOI: https://doi.org/10.1561/9781638280057. URL http://dx.doi.org/10.1561/9781638280057

work page doi:10.1561/9781638280057 2022
[11]

Graph-based approximate message passing it- erations.Information and Inference: A Journal of the IMA, 12(4):2562–2628, Sep- tember 2023

Cédric Gerbelot and Raphaël Berthier. Graph-based approximate message passing it- erations.Information and Inference: A Journal of the IMA, 12(4):2562–2628, Sep- tember 2023. ISSN 2049-8772. DOI: https://doi.org/10.1093/imaiai/iaad020. URL http://dx.doi.org/10.1093/imaiai/iaad020

work page doi:10.1093/imaiai/iaad020 2023
[12]

The gaussian equivalence of generative models for learning with shal- low neural networks

Sebastian Goldt, Bruno Loureiro, Galen Reeves, Florent Krzakala, Marc M’ezard, and Lenka Zdeborov’a. The gaussian equivalence of generative models for learning with shal- low neural networks. InMathematical and Scientific Machine Learning, 2020. URL https://api.semanticscholar.org/CorpusID:235165686. 14

2020
[13]

Javanmard and A

A. Javanmard and A. Montanari. State evolution for general approximate message pass- ing algorithms, with applications to spatial coupling.Information and Inference, 2(2): 115–144, October 2013. ISSN 2049-8772. DOI: https://doi.org/10.1093/imaiai/iat004. URL http://dx.doi.org/10.1093/imaiai/iat004

work page doi:10.1093/imaiai/iat004 2013
[14]

The ASA’ s Statement on p-Values: Context, Process, and Purpose.The American Statisti- cian2016;70(2):129–133

Agnan Kessy, Alex Lewin, and Korbinian Strimmer. Optimal whitening and decorrelation.The American Statistician, 72(4):309–314, January 2018. ISSN 1537-2731. DOI: https://doi.org/10.1080/00031305.2016.1277159. URL http://dx.doi.org/10.1080/00031305.2016.1277159

work page doi:10.1080/00031305.2016.1277159 2018
[15]

Learning multiple layers of features from tiny images

Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009

2009
[16]

Gradient-based learning applied to document recognition.Proceedings of the IEEE, 86(11):2278–2324, 1998

Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition.Proceedings of the IEEE, 86(11):2278–2324, 1998

1998
[17]

A non-asymptotic framework for approximate message passing in spiked models, 2023

Gen Li and Yuting Wei. A non-asymptotic framework for approximate message passing in spiked models, 2023

2023
[18]

Unifying amp algorithms for rotationally-invariant models, 2024

Songbin Liu and Junjie Ma. Unifying amp algorithms for rotationally-invariant models, 2024

2024
[19]

On universality of non-separable approximate message passing algorithms, 2025

Max Lovig, Tianhao Wang, and Zhou Fan. On universality of non-separable approximate message passing algorithms, 2025

2025
[20]

Francesca Mignacco, Florent Krzakala, Pierfrancesco Urbani, and Lenka Zdeborová. Dynamical mean-field theory for stochastic gradient descent in gaussian mixture classification*.Journal of Statistical Mechanics: Theory and Experiment, 2021(12):124008, December 2021. ISSN 1742-

2021
[21]

URL http://dx.doi.org/10.1088/1742- 5468/ac3a80

DOI: https://doi.org/10.1088/1742-5468/ac3a80. URL http://dx.doi.org/10.1088/1742- 5468/ac3a80

work page doi:10.1088/1742-5468/ac3a80
[22]

Dynamical decoupling of generalization and overfitting in large two-layer networks, 2025

Andrea Montanari and Pierfrancesco Urbani. Dynamical decoupling of generalization and overfitting in large two-layer networks, 2025

2025
[23]

High-dimensional limit of stochastic gradient flow via dynamical mean-field theory, 2026

Sota Nishiyama and Masaaki Imaizumi. High-dimensional limit of stochastic gradient flow via dynamical mean-field theory, 2026

2026
[24]

Dimension-free bounds for generalized first-order methods via gaussian coupling, 2025

Galen Reeves. Dimension-free bounds for generalized first-order methods via gaussian coupling, 2025

2025
[25]

Finite sample analysis of approximate message passing algorithms.IEEE Transactions on Information Theory, 64(11):7264–7286, nov 2018

Cynthia Rush and Ramji Venkataramanan. Finite sample analysis of approximate message passing algorithms.IEEE Transactions on Information Theory, 64(11):7264–7286, nov 2018. DOI: https://doi.org/10.1109/tit.2018.2816681

work page doi:10.1109/tit.2018.2816681 2018
[26]

Sophia Sklaviadis, Thomas Moellenhoff, Andre F. T. Martins, Mario A. T. Figueiredo, and Mohammad Emtiyaz Khan. A stein identity for q-gaussians with bounded support, 2026

2026
[27]

Universality of approximate message passing algorithms and tensor networks.The Annals of Applied Probability, 34(4), August 2024

Tianhao Wang, Xinyi Zhong, and Zhou Fan. Universality of approximate message passing algorithms and tensor networks.The Annals of Applied Probability, 34(4), August 2024. ISSN 1050-5164. DOI: https://doi.org/10.1214/24-aap2056. URL http://dx.doi.org/10.1214/24- aap2056

work page doi:10.1214/24-aap2056 2024
[28]

Approxi- mate message passing for multi-layer estimation in rotationally invariant mod- els

Yizhou Xu, TianQi Hou, ShanSuo Liang, and Marco Mondelli. Approxi- mate message passing for multi-layer estimation in rotationally invariant mod- els. In2023 IEEE Information Theory Workshop (ITW), page 294–298. IEEE, April 2023. DOI: https://doi.org/10.1109/itw55543.2023.10160238. URL http://dx.doi.org/10.1109/itw55543.2023.10160238. 15

work page doi:10.1109/itw55543.2023.10160238 2023
[29]

Greg Yang and Edward J. Hu. Tensor programs iv: Feature learning in infinite-width neural networks. InInternational Conference on Machine Learning (ICML), pages 11727–11737. PMLR, 2021

2021
[30]

ηr 0θ1−η1 rX p=1 ηr−p 0 α1/2Wp+α JX j=1 µjℓ⊤ j,p !!⊤ ηs 0θ1−η1 sX q=1 ηs−q 0 α1/2Wq+α JX k=1 µkℓ⊤ k,q !!# = α−1 d E

Xinyi Zhong, Tianhao Wang, and Zhou Fan. Approximate message passing for orthogo- nally invariant ensembles: multivariate non-linearities and spectral initialization.Infor- mation and Inference: A Journal of the IMA, 13(3), July 2024. ISSN 2049-8772. DOI: https://doi.org/10.1093/imaiai/iaae024. URL http://dx.doi.org/10.1093/imaiai/iaae024. AppendixA.Defer...

work page doi:10.1093/imaiai/iaae024 2024
[31]

wall clock

= (1 −η 0)2 + 2η0(1 −η 0), and using the inequality ⟨M1, M2⟩ ≤ ∥M 1∥F ∥M2∥F to absorb term(A.28) using the η2 1 prefactor. This second argument is allowed by the assumed boundedness of∥E[∇2 hΨ(mj,t + Gt, Yj,¯at)]∥F (Assumption A.3 (4) (b)), the boundedness of∥ωt∥F, ∥Σt[t, t]∥F independent of time t∈ [T ]and recognizing that ∥ℓk,tℓ⊤ k′,t∥F ≤ ∥ℓ k,t∥F ∥ℓk′,...