The Devil is in the Condition Numbers: Why is GLU Better than non-GLU Structure?
Pith reviewed 2026-05-21 06:43 UTC · model grok-4.3
The pith
GLU structures reshape the NTK spectrum to lower its condition number and speed up convergence.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The GLU structure reshapes the NTK spectrum, leading to a smaller condition number and a more compact eigenvalue distribution. This spectral property improves training dynamics, causing GLU models to converge faster than non-GLU models and producing a characteristic loss-crossing phenomenon. On models such as ViT and GPT-2, GLU shows limited effect on reducing the generalization gap, so its main value is accelerating optimization.
What carries the argument
The neural tangent kernel (NTK) spectrum of the model, whose condition number and eigenvalue distribution the GLU modifies to enable quicker gradient descent progress.
Load-bearing premise
The spectral effects observed in two-layer NTK analysis carry over to explain GLU advantages in deep nonlinear networks used in practice.
What would settle it
A direct computation showing that the NTK condition number remains unchanged or increases with GLU in a deep network would challenge the link between the two-layer analysis and real LLM performance.
Figures
read the original abstract
Gated Linear Units (GLU) and their variants are widely adopted in modern open-source large language model architectures and consistently outperform their non-gated counterparts, yet the underlying reasons for this advantage remain unclear. In this work, we study GLU by analyzing two-layer networks in the neural tangent kernel (NTK) regime. Our analysis reveals that the GLU structure reshapes the NTK spectrum, leading to a smaller condition number and a more compact eigenvalue distribution. Building on this finding, we further analyze the resulting training dynamics and show how the reshaped spectrum leads to faster convergence of GLU models, including a characteristic loss-crossing phenomenon observed between GLU and non-GLU models. Finally, we empirically observe that GLU has limited impact in reducing the generalization gap on various models, including ViT and GPT-2, suggesting that its primary benefit lies in accelerating optimization rather than reducing the generalization gap.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper analyzes Gated Linear Units (GLU) in two-layer networks under the neural tangent kernel (NTK) regime, claiming that GLU reshapes the NTK spectrum to yield a smaller condition number and more compact eigenvalue distribution. This spectral change is then linked to faster convergence and a loss-crossing phenomenon in training dynamics. Empirical results on ViT and GPT-2 are presented to show that GLU has limited impact on reducing the generalization gap, suggesting its primary benefit is optimization speed rather than generalization.
Significance. If the NTK spectral reshaping and resulting condition-number reduction can be shown to persist or dominate in deeper, non-linear architectures with residuals, the work would offer a concrete theoretical account for the empirical superiority of GLU variants in modern LLMs, shifting focus from generalization to optimization dynamics. The reported generalization-gap experiments on ViT and GPT-2 provide useful negative evidence that strengthens the optimization-focused interpretation.
major comments (2)
- [Abstract and NTK analysis sections] The central derivation of spectrum reshaping, smaller condition number, and faster convergence is performed only for two-layer networks in the NTK regime (abstract and the analysis sections). The manuscript does not derive, bound, or empirically compute the NTK spectrum (or effective condition number) for any architecture deeper than two layers, leaving the extrapolation to deep LLMs (including the GPT-2 experiments) unverified and load-bearing for the stated explanation of GLU's advantage.
- [Training dynamics analysis] The training-dynamics claims, including the loss-crossing phenomenon, rest on the two-layer NTK spectrum result. Without additional analysis showing that the same spectral properties control convergence once depth, layer-wise non-linearities, and residual connections are introduced, the link between the derived condition-number improvement and observed behavior in modern models remains an assumption rather than a demonstrated mechanism.
minor comments (2)
- [Empirical section] Dataset details, error bars, and hyper-parameter choices for the ViT and GPT-2 generalization-gap experiments should be expanded to allow readers to assess whether post-hoc fitting affects the reported limited impact on generalization.
- [NTK derivation] Notation for the NTK eigenvalues and condition number should be made fully explicit (including any dependence on the gating parameters) to clarify that the reported improvement is not an artifact of parameter choices.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and for highlighting the importance of verifying the NTK spectral properties in deeper architectures. We agree that our theoretical results are derived for two-layer networks and that extending them rigorously to deep models with residuals is a valuable direction for future work. Below, we address the major comments point by point.
read point-by-point responses
-
Referee: [Abstract and NTK analysis sections] The central derivation of spectrum reshaping, smaller condition number, and faster convergence is performed only for two-layer networks in the NTK regime (abstract and the analysis sections). The manuscript does not derive, bound, or empirically compute the NTK spectrum (or effective condition number) for any architecture deeper than two layers, leaving the extrapolation to deep LLMs (including the GPT-2 experiments) unverified and load-bearing for the stated explanation of GLU's advantage.
Authors: We acknowledge this limitation. Our analysis focuses on two-layer networks in the NTK regime to obtain closed-form insights into how GLU reshapes the spectrum and reduces the condition number. Extending this derivation to deeper networks is technically challenging due to the complexity of the NTK in the presence of residuals and multiple layers, and we do not claim to have done so. Instead, the two-layer case serves as a theoretical foundation, and we support the extrapolation with empirical evidence from training ViT and GPT-2 models, where GLU accelerates optimization without major generalization improvements. We will revise the manuscript to more explicitly state this scope and add a paragraph discussing the assumptions involved in applying the insights to modern LLMs. revision: partial
-
Referee: [Training dynamics analysis] The training-dynamics claims, including the loss-crossing phenomenon, rest on the two-layer NTK spectrum result. Without additional analysis showing that the same spectral properties control convergence once depth, layer-wise non-linearities, and residual connections are introduced, the link between the derived condition-number improvement and observed behavior in modern models remains an assumption rather than a demonstrated mechanism.
Authors: The loss-crossing is an empirical observation in our experiments on deeper models. The NTK analysis in two layers explains a plausible mechanism via improved conditioning leading to faster convergence. While we agree that demonstrating the same spectral control in deep residual networks would provide stronger evidence, our current contribution is to identify this mechanism in a tractable setting and show consistency with practice. We will update the discussion to clarify that the link is based on the simplified model and empirical corroboration, rather than a complete proof for all architectures. revision: partial
- Deriving or bounding the NTK spectrum for deep residual networks with GLU and non-linear activations.
Circularity Check
NTK spectrum analysis for two-layer networks is derived independently without reducing to fitted inputs or self-citations
full rationale
The paper's core derivation analyzes the NTK for two-layer networks under the GLU activation to obtain the eigenvalue spectrum, condition number, and resulting convergence rates directly from the kernel expressions and activation properties. These steps are mathematical derivations rather than fits to the target advantage or self-referential definitions. No load-bearing self-citation chains, ansatzes smuggled via prior work, or renaming of known results appear in the described chain; the generalization-gap experiments are presented as separate empirical observations. The analysis is self-contained within its stated two-layer NTK regime and does not reduce the claimed spectrum reshaping to its own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption NTK regime holds for the two-layer networks under consideration
Reference graph
Works this paper leans on
-
[1]
PMLR, 2017. de Ryck, T., Bonnet, F., Mishra, S., and de B ´ezenac, E. An operator preconditionning perspective on training in physics-informed machine learning. InInternational Con- ference on Learning Representation, 2024. Dey, R. and Salem, F. M. Gate-variants of gated recur- rent unit (gru) neural networks. In2017 IEEE 60th in- ternational midwest symp...
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[2]
For ReLU model, theij-th element of its NTK matrix is approximately K=αXX ⊤ +βrr ⊤ +γD.(10) Hereα= 1 4 + m 4d , β= m 2πd , γ= 1 4 + m 4d − m 2πd
-
[3]
For ReGLU model, theij-th element of its NTK matrix is approximately ˜K= ˜α(XX⊤)⊙(XX ⊤) + ˜β(rr⊤)⊙(XX ⊤) + ˜γD2.(11) Here˜α= m 4d2 + 1 2d , ˜β= 1 2πd + m 2πd2 ,˜γ= 1 2d − 1 2πd + m 4d2 − m 2πd2 . Proof. The proof is carried out in two main steps. First, we derive a general expression for the NTK matrix that is independent of the specific activation functi...
-
[4]
Consider ReLU model first: z(x) =Vϕ(Wx)
Obtaining the general form of NTK matrix. Consider ReLU model first: z(x) =Vϕ(Wx). Taking derivative w.r.t all parameters, we get: ∂z(x) ∂Vk =ϕ(W ⊤ k x), ∂z(x) ∂Wks =V kϕ′(W⊤ k x)xs, whereW k stands for thek-th row ofW. Hence we have: ⟨∇Vz(xi),∇ Vz(xj)⟩= mX k=1 ϕ(W⊤ k xi)ϕ(W⊤ k xj). ⟨∇Wz(xi),∇ Wz(xj)⟩= (x ⊤ i xj) mX k=1 V 2 k ϕ′(W⊤ k xi)ϕ′(W⊤ k xj). Takin...
-
[5]
Using arc-cosine kernel and Taylor approximation. Since we are considering ReLU activated models, we can use arc-cosine kernel to get rid of the expectation factors. Specifically, we have (Cho & Saul, 2010), Ew[ϕ(w⊤xi)ϕ(w⊤xj)] = σ2 w∥xi∥∥xj∥ 2π q 1−ρ 2 ij + π−arccosρ ij ρij , Ew[ϕ′(w⊤xi)ϕ′(w⊤xj)] = 1 2π π−arccosρ ij . Hereρ ij := x⊤ i xj ∥xi∥∥xj ∥ is the ...
work page 2010
-
[6]
For ReLU model, theij-th element of its NTK matrix is approximately Kij = m 2d + 1 2 ∥xi∥2, i=j; m 2πd + 1 4 + m 4d ρij + 1 2π + m 4πd ρ2 ij ∥xi∥∥xj∥, i̸=j. (14)
-
[7]
(15) Finally, for both models, we only keep the terms where ρij can be absorbed into x⊤ i xj
For ReGLU model, theij-th element of its NTK matrix is approximately ˜Kij = m 2d2 + 1 d ∥xi∥4, i=j; 1 2πd + m 2πd2 ρij + 1 2d + m 4d2 ρ2 ij ∥xi∥2∥xj∥2, i̸=j. (15) Finally, for both models, we only keep the terms where ρij can be absorbed into x⊤ i xj. That is, we drop the ρ2 ij term in ReLU model but keep it in ReGLU model. Then we obtain the fi...
work page 2012
-
[8]
For ReLU model, the largest eigenvalue of its NTK matrix is given by λ1(K)≈ m 2π ·n+ d 2 + (π−1)m 2π
-
[9]
Therefore, λ1(K) = Θ(mn), λ 1( ˜K) = Θ(mn/d)
For ReGLU model, the largest eigenvalue of its NTK matrix is given by m 4d + 1 2 n+ m 2 − m 2π +d− d 2π ≲λ 1( ˜K)≲ m 4d + m 2πd + 1 2 + 1 2π n+ m 2 +d. Therefore, λ1(K) = Θ(mn), λ 1( ˜K) = Θ(mn/d). Proof.1) ReLU model. For ReLU model, we note that by law of large numbers, the expression can be approximately written as: K≈αXX ⊤ +βd11 ⊤ +γdI. This form of e...
-
[10]
For ReGLU model, similarly, we have: ˜K≈˜α(XX⊤)⊙(XX ⊤) + ˜βdXX⊤ + ˜γd2I
ReGLU model. For ReGLU model, similarly, we have: ˜K≈˜α(XX⊤)⊙(XX ⊤) + ˜βdXX⊤ + ˜γd2I. We know thatXX⊤ is positive semi-definite. Therefore, by Schur product theorem (Horn & Johnson, 2012, Theorem 7.5.3), (XX⊤)⊙(XX ⊤)is also positive semi-definite. Hence, by Weyl’s inequality (Thm.B.3), we have: max n λ1(˜α(XX⊤)⊙(XX ⊤)), λ1( ˜βdXX⊤) o ≤λ 1( ˜K)−˜γd2 ≤λ 1(˜...
work page 2012
-
[11]
For ReLU model, the smallest eigenvalue of its NTK matrix is given by λn(K)≈ m+d 4 (s2 + 1)− m 2π , wheres:= max n 0,1− p n/d o
-
[12]
Therefore, bothλ n(K)andλ n( ˜K)is of orderΘ(m+d)
For ReGLU model, the smallest eigenvalue of its NTK matrix is given by λn( ˜K)≈ m+ 2d 4 (˜s2 + 1) + m+d 2π (s2 −1), 18 The Devil is in the Condition Numbers where˜s= max 0,1− √ 2n/d . Therefore, bothλ n(K)andλ n( ˜K)is of orderΘ(m+d). Whenn > d, we have that λn( ˜K)> λ n(K). Proof.1) ReLU model. For ReLU mdoel, we recall that its NTK matrix can be written...
work page 2012
-
[13]
ReGLU model. For ReGLU model, recall that ˜K= ˜α(XX⊤)⊙(XX ⊤) + ˜β(rr⊤)⊙(XX ⊤) + ˜γD2.(17) 19 The Devil is in the Condition Numbers First, we notice thatXX ⊤ is positive semidefinite. To see this, consider the Rayleigh quotient: λn(XX⊤) = min ∥v∥=1 v⊤XX⊤v=∥X ⊤v∥2 ≥0. Similarly, rr⊤ is also positive semidefinite. By Thm.B.7, rr⊤ ⊙XX ⊤ is also positive semid...
work page 2024
-
[14]
Comparing two eigenvalues. Finally, we compare the two smallest eigenvalues. Subtracting one from another, we have λn( ˜K)−λ n(K)≈ m+ 2d 4 ˜s2 + 1 4 − 1 2π [d−(m+d)s 2]. Becausen > d,s= 0. Therefore, λn( ˜K)−λ n(K)≈ m+ 2d 4 ˜s2 + 1 4 − 1 2π d >0. C. Proof of Loss Crossing Results C.1. Proof of Prop.4.1 Before proving Prop.4.1, we first prove the following...
work page 2019
-
[15]
That is, ReLU model converges faster than ReGLU model
At early stage when (ηk) is small, as long as Y⊤(K− ˜K)Y≥0, d≥5 and n≥300 , it holds that Eθ[Lk − ˜Lk]<0 . That is, ReLU model converges faster than ReGLU model
-
[16]
That is, ReGLU model takes over and converges faster than ReGLU model
At later stage when the minimum eigenvalue λmin dominates the training process, for sufficiently largek, it holds that Eθ[Lk − ˜Lk]>0. That is, ReGLU model takes over and converges faster than ReGLU model. Proof.1) Early stage. We first consider the early stage when(ηk)is relatively small. Expanding the expression in Prop.4.1, we obtain Eθ[Lk]∝Tr[(I−ηK) 2...
-
[17]
To further analyze the dynamics at later stage, we use eigendecomposition
Later stage. To further analyze the dynamics at later stage, we use eigendecomposition. Denote by (λi,v i) the eigenpairs of K and βi :=Y ⊤vi, for ReLU model, we have that: Eθ[Lk]∝Tr[(I−ηK) 2kK] +Y ⊤(I−ηK) 2kY ∝ nX i=1 (λi +β 2 i )(1−ηλ i)2k. At later stage whenk→ ∞,(1−ηλ n)2k demonates the above expression. We have, Eθ[Lk]∝(λ n +β 2 n)(1−ηλ n)2k nX i=1 λ...
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.