Bayesian Inference with Shaped Deep Non-linear MLPs
Pith reviewed 2026-06-28 20:39 UTC · model grok-4.3
The pith
To first order in LP/N, Bayesian inference in deep non-linear MLPs reduces to a data-dependent kernel method, with a criterion for when depth raises model evidence.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In the regime where LP/N equals theta of one, to first order in this parameter the Bayesian predictive posterior of deep non-linear MLPs is equivalent to that of a data-dependent kernel method. There is also a criterion on the data-generating process that determines whether increasing LP/N raises the Bayesian model evidence. The results cover smooth and ReLU activations at arbitrary temperature.
What carries the argument
The first-order expansion in the effective depth LP/N of the Neural Covariance SDE, which governs the layer-by-layer evolution of activation covariances.
Load-bearing premise
The Neural Covariance SDE framework continues to describe network behavior accurately when P, N, and L grow large together with LP/N held at order one.
What would settle it
Compute the Bayesian model evidence numerically for MLPs of several depths at fixed LP/N and check whether the change with depth matches the sign predicted by the data criterion.
read the original abstract
A central aim of deep learning theory is to characterize how neural networks make predictions in the regime of simultaneously large model and training set size. Since the limits of diverging number of model parameters and dataset size do not commute it is not clear a priori what limits exist. In this work, we shed new light on these questions by studying Bayesian inference in deep non-linear MLPs in the regime where the number of training samples ($P$), the input dimension ($N_0$), the hidden layer width ($N$), and the number of hidden layers ($L$) can all be large. We build on the Neural Covariance SDE (Li et al., 2022) to analyze predictive posteriors in the regime where $LP/N\in\Theta(1)$, playing the role of an effective network depth. Our framework covers both smooth and ReLU activation functions and applies to arbitrary temperature. We find to first order in $LP/N$ a simple criterion for which data generating processes benefit from depth in the sense that larger $LP/N$ increases the Bayesian model evidence. We also give a novel derivation of a prior result from the physics literature that at least to first order in $LP/N$, the Bayesian predictive posterior is remarkably simple and is simply equivalent to that of a data-dependent kernel method.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper analyzes Bayesian inference for deep non-linear MLPs (smooth and ReLU activations) in the joint large-P, large-N, large-L regime with LP/N = Θ(1) as an effective depth parameter. Building on the Neural Covariance SDE of Li et al. (2022), it derives to first order in LP/N a criterion identifying data-generating processes for which increasing LP/N raises the Bayesian model evidence, and shows that the predictive posterior is equivalent to that of a data-dependent kernel method. The framework applies at arbitrary temperature.
Significance. If the first-order results hold, the work supplies a concrete, testable criterion for when depth improves Bayesian evidence and a simplification of the predictive posterior to kernel form, with a novel derivation of a prior physics result. The extension to ReLU activations and general temperature is a strength. The analysis is grounded in an explicit small-parameter expansion rather than heuristic limits.
major comments (1)
- [Introduction / Neural Covariance SDE application] The extension of the Neural Covariance SDE (Li et al., 2022) to the joint L,P,N→∞ limit with LP/N=Θ(1) is invoked without a new convergence argument or error bound (see the setup paragraph citing Li et al. and the subsequent derivation of the O(LP/N) expansion). This assumption is load-bearing for both the sign of the depth correction to the model evidence and the claimed equivalence of the predictive posterior to a data-dependent kernel method; an O(1) correction in the joint limit would alter the leading-order claims.
minor comments (2)
- [Introduction] Define the effective depth parameter LP/N explicitly in the first paragraph of the introduction rather than deferring to the abstract.
- [Main derivation] Clarify whether the first-order truncation is uniform in the activation function (smooth vs. ReLU) or requires separate error estimates.
Simulated Author's Rebuttal
We thank the referee for their careful reading and constructive feedback on the manuscript. We respond point-by-point to the major comment below.
read point-by-point responses
-
Referee: [Introduction / Neural Covariance SDE application] The extension of the Neural Covariance SDE (Li et al., 2022) to the joint L,P,N→∞ limit with LP/N=Θ(1) is invoked without a new convergence argument or error bound (see the setup paragraph citing Li et al. and the subsequent derivation of the O(LP/N) expansion). This assumption is load-bearing for both the sign of the depth correction to the model evidence and the claimed equivalence of the predictive posterior to a data-dependent kernel method; an O(1) correction in the joint limit would alter the leading-order claims.
Authors: We appreciate the referee pointing out that the joint limit with LP/N=Θ(1) is invoked by direct application of the Neural Covariance SDE from Li et al. (2022) without a fresh convergence proof or explicit error bound in the present manuscript. The setup and O(LP/N) expansion indeed rely on the SDE limit established in the cited work (under large-width assumptions) and then perform a perturbative expansion treating LP/N as the small parameter. We agree this assumption is load-bearing: an uncontrolled O(1) correction would invalidate the leading-order sign of the depth correction to the evidence and the kernel equivalence. Our contribution is the first-order perturbative analysis rather than a new rigorous limit theorem. To address the concern we will revise the manuscript to (i) restate the precise assumptions inherited from Li et al., (ii) explicitly note that all claims are to first order in LP/N with higher-order terms neglected, and (iii) add a short remark that a full joint-limit convergence analysis remains open. This makes the scope of the results transparent without altering the derivations. revision: partial
Circularity Check
Central claims depend on unverified extension of Neural Covariance SDE to joint L,P,N limit with LP/N=Θ(1)
specific steps
-
self citation load bearing
[Abstract]
"We build on the Neural Covariance SDE (Li et al., 2022) to analyze predictive posteriors in the regime where LP/N∈Θ(1), playing the role of an effective network depth. ... We find to first order in LP/N a simple criterion for which data generating processes benefit from depth in the sense that larger LP/N increases the Bayesian model evidence. We also give a novel derivation of a prior result from the physics literature that at least to first order in LP/N, the Bayesian predictive posterior is remarkably simple and is simply equivalent to that of a data-dependent kernel method."
The criterion and kernel equivalence are obtained by extending the cited SDE to the joint limit with LP/N fixed at order 1. No new justification is provided for why the SDE approximation holds without O(1) corrections when L scales with P/N; the results therefore reduce to the prior framework's validity in this regime.
full rationale
The paper's strongest claims—a criterion for depth benefiting model evidence and equivalence of the predictive posterior to a data-dependent kernel—are derived to first order in LP/N by invoking the Neural Covariance SDE framework. The analysis explicitly assumes this SDE remains valid in the simultaneous large-P, N, L regime without supplying an independent convergence argument or error bound for the joint limit. This matches the moderate circularity pattern of a load-bearing self-citation (or prior framework by overlapping authors) whose applicability to the new scaling is taken as given rather than re-derived or bounded. The central claims retain some independent content in the first-order expansion and the novel derivation step, preventing a higher score.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
High-dimensional dynamics of generalization error in neural networks
arXiv:1710.03667 [stat.ML].url: https://arxiv.org/abs/1710.03667. [Bas+25] Federico Bassetti et al.Feature learning in finite-width Bayesian deep linear networks with multiple outputs and convolutional layers
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Reconciling modern machine-learning practice and the classical bias–variance trade-off
arXiv:2406 . 03260 [stat.ML].url:https://arxiv.org/abs/2406.03260. 13 [Bel+19] Mikhail Belkin et al. “Reconciling modern machine-learning practice and the classical bias–variance trade-off”. In:Proceedings of the National Academy of Sciences116.32 (July 2019), pp. 15849–15854.issn: 1091-6490.doi:10.1073/ pnas.1903070116.url:http://dx.doi.org/10.1073/pnas....
-
[3]
15267 [stat.ML].url:https://arxiv.org/abs/2411.15267
arXiv:2411. 15267 [stat.ML].url:https://arxiv.org/abs/2411.15267. [Blu+15] Charles Blundell et al.Weight Uncertainty in Neural Networks
-
[4]
Weight Uncertainty in Neural Networks
arXiv: 1505.05424 [stat.ML].url:https://arxiv.org/abs/1505.05424. [BP22] Lucas Benigni and Sandrine P´ ech´ e.Largest Eigenvalues of the Conjugate Kernel of Single-Layered Neural Networks
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
arXiv:2201.04753 [math.PR].url: https://arxiv.org/abs/2201.04753. [Cam+25] Francesco Camilli et al.Information-theoretic reduction of deep neural net- works to linear models in the overparametrized proportional regime
-
[6]
Deterministic equivalent of the conjugate kernel matrix associated to artificial neural networks
arXiv: 2505.03577 [math.ST].url:https://arxiv.org/abs/2505.03577. [Cho23] Cl´ ement Chouard. “Deterministic equivalent of the conjugate kernel matrix associated to artificial neural networks”. In:arXiv preprint arXiv:2306.05850 (2023). [COB19] Lenaic Chizat, Edouard Oyallon, and Francis Bach. “On lazy training in differ- entiable programming”. In:Advances...
-
[7]
Nonlinear Approximation and (Deep) ReLU Networks
arXiv:1905.02199 [cs.LG].url:https://arxiv.org/abs/1905.02199. [Du+19] Simon S. Du et al.Gradient Descent Provably Optimizes Over-parameterized Neural Networks
work page internal anchor Pith review Pith/arXiv arXiv 1905
-
[8]
Gradient Descent Provably Optimizes Over-parameterized Neural Networks
arXiv:1810.02054 [cs.LG].url:https://arxiv. org/abs/1810.02054. [El 10] Noureddine El Karoui. “The spectrum of kernel random matrices”. In: (2010). [FW20] Zhou Fan and Zhichao Wang. “Spectra of the Conjugate Kernel and Neu- ral Tangent Kernel for linear-width neural networks”. In:Advances in Neural Information Processing Systems. Ed. by H. Larochelle et al. Vol
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[9]
7710–7721.url:https://proceedings.neurips
Curran Associates, Inc., 2020, pp. 7710–7721.url:https://proceedings.neurips. cc/paper_files/paper/2020/file/572201a4497b0b9f02d4f279b09ec30d- Paper.pdf. [GG16] Yarin Gal and Zoubin Ghahramani.Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning
2020
-
[10]
Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning
arXiv:1506.02142 [stat.ML].url:https://arxiv.org/abs/1506.02142. [Han19] Boris Hanin. “Universal function approximation by deep neural nets with bounded width and relu activations”. In:Mathematics7.10 (2019), p
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[11]
Random Fully Connected Neural Networks as Perturbatively Solvable Hierarchies
14 [Han24] Boris Hanin. “Random Fully Connected Neural Networks as Perturbatively Solvable Hierarchies”. In:Journal of Machine Learning Research25.267 (2024), pp. 1–58.url:http://jmlr.org/papers/v25/23-0643.html. [HDR19] Soufiane Hayou, Arnaud Doucet, and Judith Rousseau. “Training dynamics of deep networks using stochastic gradient descent via neural tan...
2024
-
[12]
Classifying Calabi–Yau Threefolds Using Infinite Distance Limits,
arXiv:2503.07872 [math.PR]. url:https://arxiv.org/abs/2503.07872. [HN19a] Boris Hanin and Mihai Nica. “Finite depth and width corrections to the neural tangent kernel”. In:arXiv preprint arXiv:1909.05989(2019). [HN19b] Boris Hanin and Mihai Nica. “Products of Many Large Random Matrices and Gradients in Deep Neural Networks”. In:Communications in Mathemat-...
-
[13]
Deep neural networks as gaussian processes
arXiv:1806.07572 [cs.LG].url:https://arxiv.org/abs/1806.07572. [Lee+17] Jaehoon Lee et al. “Deep neural networks as gaussian processes”. In:arXiv preprint arXiv:1711.00165(2017). [Li+26] Mufan Li et al.Geometric Dyson Brownian Motions and the Free Log-Normal Limit for a Non-Square Product of Random Matrices
-
[14]
The neural covariance SDE: Shaped infinite depth-and-width networks at initialization
arXiv:2310 . 12079 [stat.ML].url: https://arxiv.org/abs/2310.12079. [LNR22] Mufan Li, Mihai Nica, and Dan Roy. “The neural covariance SDE: Shaped infinite depth-and-width networks at initialization”. In:Advances in Neural Information Processing Systems35 (2022), pp. 10795–10808. 15 [LS21] Qianyi Li and Haim Sompolinsky. “Statistical Mechanics of Deep Line...
-
[15]
A statistical mechanics framework for Bayesian deep neural networks beyond the infinite-width limit
arXiv:2306 . 17759 [stat.ML].url:https : //arxiv.org/abs/2306.17759. [Pac+23] R. Pacelli et al. “A statistical mechanics framework for Bayesian deep neural networks beyond the infinite-width limit”. In:Nature Machine Intelligence5.12 (Dec. 2023), pp. 1497–1507.issn: 2522-5839.doi:10 . 1038 / s42256 - 023 - 00767-6.url:http://dx.doi.org/10.1038/s42256-023-...
-
[16]
Exponential expressivity in deep neural networks through transient chaos
arXiv:1606.05340 [stat.ML].url:https://arxiv.org/ abs/1606.05340. [PSW13] Nicholas G Polson, James G Scott, and Jesse Windle. “Bayesian inference for logistic models using P´ olya–Gamma latent variables”. In:Journal of the Amer- ican statistical Association108.504 (2013), pp. 1339–1349. [PW17] Jeffrey Pennington and Pratik Worah. “Nonlinear random matrix ...
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[17]
arXiv:1611. 01232 [stat.ML].url:https://arxiv.org/abs/1611.01232. [SCS88] H. Sompolinsky, A. Crisanti, and H. J. Sommers. “Chaos in Random Neu- ral Networks”. In:Phys. Rev. Lett.61 (3 July 1988), pp. 259–262.doi:10. 1103/PhysRevLett.61.259.url:https://link.aps.org/doi/10.1103/ PhysRevLett.61.259. [SNR23] Inbar Seroussi, Gadi Naveh, and Zohar Ringel. “Sepa...
work page internal anchor Pith review Pith/arXiv arXiv 1988
-
[18]
Mean field analysis of neural networks: A law of large numbers
[SS20] Justin Sirignano and Konstantinos Spiliopoulos. “Mean field analysis of neural networks: A law of large numbers”. In:SIAM Journal on Applied Mathematics 80.2 (2020), pp. 725–752. [Tre23] Dario Trevisan.Wide Deep Neural Networks with Gaussian Weights are Very Close to Gaussian Processes
2020
-
[19]
Computing with Infinite Networks
arXiv:2312.11737 [math.ST].url:https: //arxiv.org/abs/2312.11737. [Wil96] Christopher Williams. “Computing with Infinite Networks”. In:Advances in Neural Information Processing Systems. Ed. by M.C. Mozer, M. Jordan, and T. Petsche. Vol
-
[20]
cc/paper_files/paper/1996/file/ae5e3ce40e0404a45ecacaaf05e5f735- Paper.pdf
MIT Press, 1996.url:https://proceedings.neurips. cc/paper_files/paper/1996/file/ae5e3ce40e0404a45ecacaaf05e5f735- Paper.pdf. [WWF24] Zhichao Wang, Denny Wu, and Zhou Fan.Nonlinear spiked covariance matri- ces and signal propagation in deep neural networks
1996
-
[21]
[Yan21] Greg Yang.Tensor Programs III: Neural Matrix Laws
arXiv:2402.10127 [stat.ML].url:https://arxiv.org/abs/2402.10127. [Yan21] Greg Yang.Tensor Programs III: Neural Matrix Laws
-
[22]
Deep learning with- out shortcuts: Shaping the kernel with tailored rectifiers
arXiv:2009.10685 [cs.NE].url:https://arxiv.org/abs/2009.10685. [ZBM22] Guodong Zhang, Aleksandar Botev, and James Martens. “Deep learning with- out shortcuts: Shaping the kernel with tailored rectifiers”. In:arXiv preprint arXiv:2203.08120(2022). [Zha+17] Chiyuan Zhang et al.Understanding deep learning requires rethinking general- ization
-
[23]
Understanding deep learning requires rethinking generalization
arXiv:1611.03530 [cs.LG].url:https://arxiv.org/abs/ 1611.03530. 17 A Computing the prior: Neural Covariance SDE Let us first lay out the foundations of NSDE as we need. Recall again the forward pass (1): z1 = 1√N0 W0x, ϕ ℓ =ϕ s(zℓ), z ℓ+1 = r c N Wℓϕℓ, y=z out = r c N WoutϕL ∈R Consider network weights at initialization with shaped activationϕs in (1) on ...
work page internal anchor Pith review Pith/arXiv arXiv
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.