arxiv: 2605.05113 · v1 · submitted 2026-05-06 · 💻 cs.LG

Recognition: 3 theorem links

· Lean Theorem

How Long Does Infinite Width Last? Signal Propagation in Long-Range Linear Recurrences

Mariia Seleznova

Pith reviewed 2026-05-08 17:48 UTC · model grok-4.3

classification 💻 cs.LG

keywords signal propagationfinite-width effectslinear recurrent modelsinfinite-width limitscaling regimesinitialization stabilityrecurrent depth

0 comments

The pith

Infinite-width theory for signal propagation in linear recurrences holds only up to depth scaling as the square root of width.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper derives exact finite-width formulas for the energies of hidden states in linear recurrent models initialized with complex Gaussians. It identifies three joint scaling regimes between recurrent depth t and width n that determine when the infinite-width approximation remains accurate. The subcritical regime t much less than sqrt(n) preserves the approximation, the critical regime t proportional to sqrt(n) introduces noticeable deviations, and the supercritical regime t much larger than sqrt(n) is dominated by finite-width effects. A sympathetic reader would care because this pinpoints exactly when common initializations like Glorot become unstable for long sequences in recurrent models. The findings also indicate that finite-width effects build up faster with depth in recurrent architectures than in feedforward networks.

Core claim

Exact finite-width formulas for hidden state signal energies reveal that infinite-width predictions remain valid in the subcritical regime where recurrent depth grows slower than the square root of width, that nontrivial deviations emerge in the critical regime at depths proportional to sqrt(width), and that finite-width effects dominate in the supercritical regime. This establishes the precise depth scale at which infinite-width theory breaks down and standard initializations become unstable in long-range linear recurrences, with finite-width corrections accumulating more rapidly than in feedforward models.

What carries the argument

Exact finite-width formulas for hidden state signal energies under complex Gaussian initialization, which identify the subcritical, critical, and supercritical joint depth-width scaling regimes for signal propagation.

If this is right

In the subcritical regime where t = o(sqrt(n)), the infinite-width approximation for signal propagation remains valid.
In the critical regime where t ~ c sqrt(n), non-negligible deviations from infinite-width predictions emerge.
In the supercritical regime where t >> sqrt(n), finite-width effects dominate signal propagation.
Standard initialization schemes such as Glorot become unstable beyond the critical depth scale.
Finite-width effects accumulate more rapidly with depth in recurrent models than in feedforward models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Practical recurrent neural networks with nonlinear activations may exhibit similar scaling breakdowns, warranting numerical checks at depth around sqrt(width).
The critical scaling could guide the design of width-dependent initialization methods for long-sequence models.
Extending the analysis to real-valued weights might shift the critical exponent away from square root.
These regimes suggest that mean-field limits for recurrent networks require careful depth-width balancing not needed in feedforward cases.

Load-bearing premise

The derivation relies on linear recurrences and complex Gaussian initialization of the weights.

What would settle it

Simulations computing the signal energy variance for widths around 10,000 and depths from 100 to 1000 to check if the transition to finite-width dominance occurs near depth equal to sqrt(width).

Figures

Figures reproduced from arXiv: 2605.05113 by Mariia Seleznova.

**Figure 1.** Figure 1: Depth–width scaling laws in linear recurrences. The behavior of the signal energies under three joint scaling regimes of recurrent depth and width: (i) the subcritical regime k, t = o( √ n), (ii) the critical regime k, t ∼ c √ n, and (iii) the supercritical regime subset √ n ≪ k, t ≪ n 2/3 . Panel (a) shows the LRU signal energy Sn,t, and panel (b) shows the RNN signal energy Qn,k, as defined in (4). Blue … view at source ↗

**Figure 2.** Figure 2: Numerical validation of the critical scaling theory. Panel (a) shows Monte Carlo estimates of Sn,t, and panel (b) shows Monte Carlo estimates of Qn,k, each based on 105 samples and plotted against the critical scaling variable c for several widths n. In both panels, the numerical estimates are compared with the corresponding theoretical critical profiles, given by (14) for Sn,t and (13) for Qn,k. The close… view at source ↗

**Figure 3.** Figure 3: Real Gaussian weights case. As in view at source ↗

read the original abstract

We study signal propagation in linear recurrent models at finite width. While existing signal propagation theory relies predominantly on the infinite-width limit, it remains unclear for how long that approximation remains accurate when recurrent depth $t$ grows jointly with width $n$. This question is especially relevant for modern recurrent sequence models, whose natural operating regime involves long input sequences, i.e., large $t$. We derive exact finite-width formulas for the hidden state signal energies in linear recurrences under complex Gaussian initialization. Using these formulas, we identify the joint depth-width scaling regimes that govern signal propagation: (i) a subcritical regime $t=o(\sqrt n)$, in which the infinite-width approximation remains valid; (ii) a critical regime $t\sim c\sqrt n$, in which non-negligible deviations from infinite-width predictions appear and a nontrivial joint scaling limit emerges; and (iii) a supercritical regime $t\gg \sqrt n$, in which finite-width effects dominate. Thus, our results pinpoint the precise recurrent depth scale at which infinite-width theory breaks down in long-range linear recurrences. In turn, this shows when standard initialization schemes, such as Glorot, become unstable. More broadly, our results demonstrate that finite-width effects accumulate more rapidly with depth in recurrent models than in feedforward ones, leading to qualitatively different signal propagation behavior.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper derives exact finite-width formulas for signal energies in linear recurrences and identifies the t ~ sqrt(n) scaling where infinite-width theory breaks, but the Glorot instability claim does not follow from the linear complex-Gaussian setup.

read the letter

The main point is that this work supplies closed-form expressions for hidden-state variances in finite-width linear recurrent models with complex Gaussian weights, and it cleanly separates three joint depth-width regimes: subcritical where infinite-width holds, critical around t scaling as sqrt(n), and supercritical where finite-width effects take over. That scaling result is new relative to the infinite-width signal propagation literature and shows recurrences accumulate deviations faster than feedforward nets do. The derivations rest on direct moment calculations under the stated initialization, which keeps things parameter-free and reproducible in principle for the linear case. The formulas themselves look like a solid addition for anyone tracking exact behavior in recurrent signal propagation. The soft spot is the abstract's assertion that the results show when standard schemes like Glorot become unstable. The analysis stays inside linear recurrences and complex Gaussians; Glorot uses real-valued weights with specific variance and is applied to nonlinear networks. No bridging calculation, real-valued extension, or simulation appears to connect the two, so that practical implication sits outside what the math actually delivers. The rest of the paper stays within its stated assumptions without overclaiming inside the derivations. This is a narrow theoretical piece aimed at researchers who care about finite-width effects in sequence models and initialization theory. A reader already working on signal propagation or RNN foundations will get direct value from the regimes and formulas. I would send it to peer review. The exact results are worth referee time even if the scope of the conclusions needs tightening.

Referee Report

1 major / 0 minor

Summary. The paper derives exact closed-form expressions for hidden-state signal energies in linear recurrent models under complex Gaussian weight initialization at finite width n and recurrent depth t. From these, it identifies three joint scaling regimes: subcritical (t = o(√n)) where infinite-width predictions remain accurate, critical (t ∼ c√n) featuring a nontrivial joint limit with non-negligible deviations, and supercritical (t ≫ √n) where finite-width effects dominate. The authors conclude that this pinpoints the depth scale at which infinite-width theory breaks down in long-range linear recurrences and indicates when standard initializations such as Glorot become unstable, while noting that finite-width effects accumulate faster with depth in recurrent than feedforward models.

Significance. If the closed-form derivations hold, the work supplies a precise, non-asymptotic characterization of signal propagation that directly yields the three regimes without relying on approximations or fitted parameters. This is a clear strength for theoretical understanding of recurrent models. The observation that recurrent finite-width corrections set in at t ∼ √n (rather than the slower t ∼ n^{1/4} or similar scales sometimes seen in feedforward nets) is potentially useful for guiding initialization and depth choices in sequence models. Significance is reduced, however, by the restriction to linear dynamics and complex Gaussians, which leaves open the transfer to the nonlinear, real-valued networks where Glorot is actually applied.

major comments (1)

[Abstract] Abstract (final paragraph) and concluding discussion: the claim that the derived regimes 'show when standard initialization schemes, such as Glorot, become unstable' is unsupported by the analysis. The exact formulas and scaling limits are obtained only for linear recurrences with complex Gaussian initialization; Glorot uses real-valued weights with variance scaling 2/(fan-in + fan-out) and is deployed inside nonlinear activations. No moment calculations for real Gaussians, no extension to nonlinearities, and no empirical verification are provided to justify the transfer of the t ∼ √n breakdown.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their thorough review and for highlighting the need to align our claims more precisely with the scope of the analysis. We address the major comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] Abstract (final paragraph) and concluding discussion: the claim that the derived regimes 'show when standard initialization schemes, such as Glorot, become unstable' is unsupported by the analysis. The exact formulas and scaling limits are obtained only for linear recurrences with complex Gaussian initialization; Glorot uses real-valued weights with variance scaling 2/(fan-in + fan-out) and is deployed inside nonlinear activations. No moment calculations for real Gaussians, no extension to nonlinearities, and no empirical verification are provided to justify the transfer of the t ∼ √n breakdown.

Authors: We agree that the specific claim linking our regimes directly to the instability of Glorot initialization is not supported by the current derivations, which are restricted to linear dynamics under complex Gaussian weights. The reference to Glorot was meant to suggest broader relevance for initialization in recurrent models, but we acknowledge that transferring the t ∼ √n scale to real-valued nonlinear networks would require additional analysis of moments under real Gaussians, incorporation of nonlinear activations, and ideally empirical checks. We will revise the abstract and concluding discussion to remove this claim, instead stating that the results characterize the breakdown of infinite-width theory in linear recurrent models and provide a foundation for studying finite-width effects in more general recurrent architectures. A short limitations paragraph will be added to discuss the linear/complex-Gaussian restriction and outline extensions to real weights and nonlinearities as future work. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation follows from direct expectation calculations under stated assumptions

full rationale

The paper computes exact finite-width expressions for hidden-state signal energies by taking expectations over complex Gaussian weights in linear recurrences. Scaling regimes (subcritical t = o(√n), critical t ∼ c√n, supercritical t ≫ √n) are obtained by asymptotic analysis of those closed-form expressions. No fitted parameters are relabeled as predictions, no self-citations supply load-bearing uniqueness theorems, and no ansatz is smuggled in; the claimed breakdown scale is a direct consequence of the derived formulas rather than a redefinition of the inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the exact solvability of moments under complex Gaussian initialization and the linearity of the recurrence; these are standard domain assumptions rather than ad-hoc inventions or fitted parameters.

axioms (2)

domain assumption Weights are initialized as complex Gaussian random variables
Invoked to obtain closed-form expressions for signal energies.
domain assumption The recurrence is strictly linear
Allows exact propagation of second moments without nonlinear complications.

pith-pipeline@v0.9.0 · 5536 in / 1354 out tokens · 50495 ms · 2026-05-08T17:48:26.001787+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Cost.FunctionalEquation (no parallel: RS cost J is not a Gaussian trace moment) washburn_uniqueness_aczel unclear
E[(1/n) tr((W^k)* W^k)] = (F^+_{k+2}(n) − F^-_{k+2}(n)) / (n^{k+1}(k+1)(k+2))

Reference graph

Works this paper leans on

44 extracted references · 13 canonical work pages · 4 internal anchors

[1]

The recurrent neural tangent kernel

Sina Alemohammad, Zichao Wang, Randall Balestriero, and Richard Baraniuk. The recurrent neural tangent kernel. InThe International Conference on Learning Representations, 2021

2021
[2]

Hultman numbers, polygon gluings and matrix integrals

Nikita Alexeev and Peter Zograf. Hultman numbers, polygon gluings and matrix integrals. arXiv preprint arXiv:1111.3061, 2011

work page arXiv 2011
[3]

Layer Normalization

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization.arXiv preprint arXiv:1607.06450, 2016

work page internal anchor Pith review arXiv 2016
[4]

Revisiting glorot initialization for long-range linear recurrences

Noga Bar, Mariia Seleznova, Yotam Alexander, Gitta Kutyniok, and Raja Giryes. Revisiting glorot initialization for long-range linear recurrences. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

2025
[5]

Learning long-term dependencies with gradient descent is difficult.IEEE transactions on neural networks, 5(2):157–166, 1994

Yoshua Bengio, Patrice Simard, and Paolo Frasconi. Learning long-term dependencies with gradient descent is difficult.IEEE transactions on neural networks, 5(2):157–166, 1994

1994
[6]

Dynamical isometry and a mean field theory of rnns: Gating enables signal propagation in recurrent neural networks

Minmin Chen, Jeffrey Pennington, and Samuel Schoenholz. Dynamical isometry and a mean field theory of rnns: Gating enables signal propagation in recurrent neural networks. In International Conference on Machine Learning, pages 873–882. PMLR, 2018

2018
[7]

Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation

Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder- decoder for statistical machine translation.arXiv preprint arXiv:1406.1078, 2014

work page internal anchor Pith review arXiv 2014
[8]

Griffin: Mixing gated linear recurrences with local attention for efficient language models,

Soham De, Samuel L Smith, Anushan Fernando, Aleksandar Botev, George Cristian-Muraru, Albert Gu, Ruba Haroun, Leonard Berrada, Yutian Chen, Srivatsan Srinivasan, et al. Griffin: Mixing gated linear recurrences with local attention for efficient language models.arXiv preprint arXiv:2402.19427, 2024

work page arXiv 2024
[9]

On the number of cycles in commutators of random permutations.The Annals of Applied Probability, 34(4):4072–4084, 2024

Guillaume Dubach. On the number of cycles in commutators of random permutations.The Annals of Applied Probability, 34(4):4072–4084, 2024

2024
[10]

Hungry hungry hippos: Towards language modeling with state space models

Daniel Y Fu, Tri Dao, Khaled Kamal Saab, Armin W Thomas, Atri Rudra, and Christopher Re. Hungry hungry hippos: Towards language modeling with state space models. InThe Eleventh International Conference on Learning Representations, 2023

2023
[11]

Lstm recurrent networks learn simple context-free and context-sensitive languages.IEEE transactions on neural networks, 12(6):1333–1340, 2001

Felix A Gers and E Schmidhuber. Lstm recurrent networks learn simple context-free and context-sensitive languages.IEEE transactions on neural networks, 12(6):1333–1340, 2001

2001
[12]

Dynamical isometry and a mean field theory of lstms and grus.arXiv preprint arXiv:1901.08987, 2019

Dar Gilboa, Bo Chang, Minmin Chen, Greg Yang, Samuel S Schoenholz, Ed H Chi, and Jeffrey Pennington. Dynamical isometry and a mean field theory of lstms and grus.arXiv preprint arXiv:1901.08987, 2019

work page arXiv 1901
[13]

Understanding the difficulty of training deep feedfor- ward neural networks

Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedfor- ward neural networks. InProceedings of the thirteenth international conference on artificial intelligence and statistics, pages 249–256. JMLR Workshop and Conference Proceedings, 2010

2010
[14]

The edge of chaos: quantum field theory and deep neural networks.SciPost Physics, 12(3):081, 2022

Kevin Grosvenor and Ro Jefferson. The edge of chaos: quantum field theory and deep neural networks.SciPost Physics, 12(3):081, 2022

2022
[15]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023

work page internal anchor Pith review arXiv 2023
[16]

Hippo: Recurrent memory with optimal polynomial projections.Advances in neural information processing systems, 33:1474–1487, 2020

Albert Gu, Tri Dao, Stefano Ermon, Atri Rudra, and Christopher Ré. Hippo: Recurrent memory with optimal polynomial projections.Advances in neural information processing systems, 33:1474–1487, 2020

2020
[17]

On the parameterization and initialization of diagonal state space models.Advances in Neural Information Processing Systems, 35:35971–35983, 2022

Albert Gu, Karan Goel, Ankit Gupta, and Christopher Ré. On the parameterization and initialization of diagonal state space models.Advances in Neural Information Processing Systems, 35:35971–35983, 2022

2022
[18]

Efficiently Modeling Long Sequences with Structured State Spaces

Albert Gu, Karan Goel, and Christopher Ré. Efficiently modeling long sequences with structured state spaces.arXiv preprint arXiv:2111.00396, 2021. 10

work page internal anchor Pith review arXiv 2021
[19]

Which neural net architectures give rise to exploding and vanishing gradients? Advances in neural information processing systems, 31, 2018

Boris Hanin. Which neural net architectures give rise to exploding and vanishing gradients? Advances in neural information processing systems, 31, 2018

2018
[20]

Random fully connected neural networks as perturbatively solvable hierarchies

Boris Hanin. Random fully connected neural networks as perturbatively solvable hierarchies. Journal of Machine Learning Research, 25(267):1–58, 2024

2024
[21]

Finite depth and width corrections to the neural tangent kernel

Boris Hanin and Mihai Nica. Finite depth and width corrections to the neural tangent kernel. arXiv preprint arXiv:1909.05989, 2019

work page arXiv 1909
[22]

Products of many large random matrices and gradients in deep neural networks.Communications in Mathematical Physics, 376(1):287–322, 2020

Boris Hanin and Mihai Nica. Products of many large random matrices and gradients in deep neural networks.Communications in Mathematical Physics, 376(1):287–322, 2020

2020
[23]

Delving deep into rectifiers: Surpassing human-level performance on imagenet classification

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. InProceedings of the IEEE international conference on computer vision, pages 1026–1034, 2015

2015
[24]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

2016
[25]

Long short-term memory.Neural computation, 9(8):1735–1780, 1997

Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory.Neural computation, 9(8):1735–1780, 1997

1997
[26]

On a formula for the product-moment coefficient of any order of a normal frequency distribution in any number of variables.Biometrika, 12(1/2):134–139, 1918

Leon Isserlis. On a formula for the product-moment coefficient of any order of a normal frequency distribution in any number of variables.Biometrika, 12(1/2):134–139, 1918

1918
[27]

Neural tangent kernel: Convergence and generalization in neural networks.Advances in neural information processing systems, 31, 2018

Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks.Advances in neural information processing systems, 31, 2018

2018
[28]

arXiv preprint arXiv:1804.11271 , year=

Alexander G de G Matthews, Mark Rowland, Jiri Hron, Richard E Turner, and Zoubin Ghahramani. Gaussian process behaviour in wide deep neural networks.arXiv preprint arXiv:1804.11271, 2018

work page arXiv 2018
[29]

Resurrecting recurrent neural networks for long sequences

Antonio Orvieto, Samuel L Smith, Albert Gu, Anushan Fernando, Caglar Gulcehre, Razvan Pascanu, and Soham De. Resurrecting recurrent neural networks for long sequences. In International Conference on Machine Learning, pages 26670–26698. PMLR, 2023

2023
[30]

On the difficulty of training recurrent neural networks

Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training recurrent neural networks. InInternational conference on machine learning, pages 1310–1318. Pmlr, 2013

2013
[31]

Exponential expressivity in deep neural networks through transient chaos.Advances in neural information processing systems, 29, 2016

Ben Poole, Subhaneil Lahiri, Maithra Raghu, Jascha Sohl-Dickstein, and Surya Ganguli. Exponential expressivity in deep neural networks through transient chaos.Advances in neural information processing systems, 29, 2016

2016
[32]

Cambridge University Press Cambridge, MA, USA, 2022

Daniel A Roberts, Sho Yaida, and Boris Hanin.The principles of deep learning theory, volume 46. Cambridge University Press Cambridge, MA, USA, 2022

2022
[33]

Deep informa- tion propagation.arXiv preprint arXiv:1611.01232, 2016

Samuel S Schoenholz, Justin Gilmer, Surya Ganguli, and Jascha Sohl-Dickstein. Deep informa- tion propagation.arXiv preprint arXiv:1611.01232, 2016

work page arXiv 2016
[34]

Unified field theoretical approach to deep and recurrent neuronal networks

Kai Segadlo, Bastian Epping, Alexander Van Meegen, David Dahmen, Michael Krämer, and Moritz Helias. Unified field theoretical approach to deep and recurrent neuronal networks. Journal of Statistical Mechanics: Theory and Experiment, 2022(10):103401, 2022

2022
[35]

Neural tangent kernel beyond the infinite-width limit: Effects of depth and initialization

Mariia Seleznova and Gitta Kutyniok. Neural tangent kernel beyond the infinite-width limit: Effects of depth and initialization. InInternational Conference on Machine Learning, pages 19522–19560. PMLR, 2022

2022
[36]

Two enumerative results on cycles of permutations.European Journal of Combinatorics, 32(6):937–943, 2011

Richard P Stanley. Two enumerative results on cycles of permutations.European Journal of Combinatorics, 32(6):937–943, 2011. 11

2011
[37]

Dynamical isometry is achieved in residual networks in a universal way for any activation function

Wojciech Tarnowski, Piotr Warchol, Stanislaw Jastrzebski, Jacek Tabor, and Maciej Nowak. Dynamical isometry is achieved in residual networks in a universal way for any activation function. InThe 22nd International Conference on Artificial Intelligence and Statistics, pages 2221–2230. PMLR, 2019

2019
[38]

The evaluation of the collision matrix.Physical review, 80(2):268, 1950

Gian-Carlo Wick. The evaluation of the collision matrix.Physical review, 80(2):268, 1950

1950
[39]

Dynamical isometry and a mean field theory of cnns: How to train 10,000-layer vanilla convolutional neural networks

Lechao Xiao, Yasaman Bahri, Jascha Sohl-Dickstein, Samuel Schoenholz, and Jeffrey Pen- nington. Dynamical isometry and a mean field theory of cnns: How to train 10,000-layer vanilla convolutional neural networks. InInternational Conference on Machine Learning, pages 5393–5402. PMLR, 2018

2018
[40]

Non-gaussian processes and neural networks at finite widths

Sho Yaida. Non-gaussian processes and neural networks at finite widths. InMathematical and Scientific Machine Learning, pages 165–192. PMLR, 2020

2020
[41]

Greg Yang. Scaling limits of wide neural networks with weight sharing: Gaussian pro- cess behavior, gradient independence, and neural tangent kernel derivation.arXiv preprint arXiv:1902.04760, 2019

work page arXiv 1902
[42]

arXiv preprint arXiv:2006.14548 , year =

Greg Yang. Tensor programs ii: Neural tangent kernel for any architecture.arXiv preprint arXiv:2006.14548, 2020

work page arXiv 2006
[43]

Tensor programs iii: Neural matrix laws.arXiv preprint arXiv:2009.10685, 2020

Greg Yang. Tensor programs iii: Neural matrix laws.arXiv preprint arXiv:2009.10685, 2020

work page arXiv 2009
[44]

Tensor programs iv: Feature learning in infinite-width neural networks

Greg Yang and Edward J Hu. Tensor programs iv: Feature learning in infinite-width neural networks. InInternational Conference on Machine Learning, pages 11727–11737. PMLR, 2021. 12 A Appendix B Proof of Theorem 2.1 The central part of the proof is the exact evaluation of the trace moment E tr (Wk)∗Wk for finite n and k, given in the following Subsections ...

2021