The Devil is in the Condition Numbers: Why is GLU Better than non-GLU Structure?
Pith reviewed 2026-05-21 06:43 UTC · model grok-4.3
The pith
GLU structures reshape the NTK spectrum to lower its condition number and speed up convergence.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The GLU structure reshapes the NTK spectrum, leading to a smaller condition number and a more compact eigenvalue distribution. This spectral property improves training dynamics, causing GLU models to converge faster than non-GLU models and producing a characteristic loss-crossing phenomenon. On models such as ViT and GPT-2, GLU shows limited effect on reducing the generalization gap, so its main value is accelerating optimization.
What carries the argument
The neural tangent kernel (NTK) spectrum of the model, whose condition number and eigenvalue distribution the GLU modifies to enable quicker gradient descent progress.
Load-bearing premise
The spectral effects observed in two-layer NTK analysis carry over to explain GLU advantages in deep nonlinear networks used in practice.
What would settle it
A direct computation showing that the NTK condition number remains unchanged or increases with GLU in a deep network would challenge the link between the two-layer analysis and real LLM performance.
Figures
read the original abstract
Gated Linear Units (GLU) and their variants are widely adopted in modern open-source large language model architectures and consistently outperform their non-gated counterparts, yet the underlying reasons for this advantage remain unclear. In this work, we study GLU by analyzing two-layer networks in the neural tangent kernel (NTK) regime. Our analysis reveals that the GLU structure reshapes the NTK spectrum, leading to a smaller condition number and a more compact eigenvalue distribution. Building on this finding, we further analyze the resulting training dynamics and show how the reshaped spectrum leads to faster convergence of GLU models, including a characteristic loss-crossing phenomenon observed between GLU and non-GLU models. Finally, we empirically observe that GLU has limited impact in reducing the generalization gap on various models, including ViT and GPT-2, suggesting that its primary benefit lies in accelerating optimization rather than reducing the generalization gap. The code is available at: https://github.com/Zemdalk/GLU-NTK.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper analyzes Gated Linear Units (GLU) in two-layer networks under the neural tangent kernel (NTK) regime, claiming that GLU reshapes the NTK spectrum to yield a smaller condition number and more compact eigenvalue distribution. This spectral change is then linked to faster convergence and a loss-crossing phenomenon in training dynamics. Empirical results on ViT and GPT-2 are presented to show that GLU has limited impact on reducing the generalization gap, suggesting its primary benefit is optimization speed rather than generalization.
Significance. If the NTK spectral reshaping and resulting condition-number reduction can be shown to persist or dominate in deeper, non-linear architectures with residuals, the work would offer a concrete theoretical account for the empirical superiority of GLU variants in modern LLMs, shifting focus from generalization to optimization dynamics. The reported generalization-gap experiments on ViT and GPT-2 provide useful negative evidence that strengthens the optimization-focused interpretation.
major comments (2)
- [Abstract and NTK analysis sections] The central derivation of spectrum reshaping, smaller condition number, and faster convergence is performed only for two-layer networks in the NTK regime (abstract and the analysis sections). The manuscript does not derive, bound, or empirically compute the NTK spectrum (or effective condition number) for any architecture deeper than two layers, leaving the extrapolation to deep LLMs (including the GPT-2 experiments) unverified and load-bearing for the stated explanation of GLU's advantage.
- [Training dynamics analysis] The training-dynamics claims, including the loss-crossing phenomenon, rest on the two-layer NTK spectrum result. Without additional analysis showing that the same spectral properties control convergence once depth, layer-wise non-linearities, and residual connections are introduced, the link between the derived condition-number improvement and observed behavior in modern models remains an assumption rather than a demonstrated mechanism.
minor comments (2)
- [Empirical section] Dataset details, error bars, and hyper-parameter choices for the ViT and GPT-2 generalization-gap experiments should be expanded to allow readers to assess whether post-hoc fitting affects the reported limited impact on generalization.
- [NTK derivation] Notation for the NTK eigenvalues and condition number should be made fully explicit (including any dependence on the gating parameters) to clarify that the reported improvement is not an artifact of parameter choices.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and for highlighting the importance of verifying the NTK spectral properties in deeper architectures. We agree that our theoretical results are derived for two-layer networks and that extending them rigorously to deep models with residuals is a valuable direction for future work. Below, we address the major comments point by point.
read point-by-point responses
-
Referee: [Abstract and NTK analysis sections] The central derivation of spectrum reshaping, smaller condition number, and faster convergence is performed only for two-layer networks in the NTK regime (abstract and the analysis sections). The manuscript does not derive, bound, or empirically compute the NTK spectrum (or effective condition number) for any architecture deeper than two layers, leaving the extrapolation to deep LLMs (including the GPT-2 experiments) unverified and load-bearing for the stated explanation of GLU's advantage.
Authors: We acknowledge this limitation. Our analysis focuses on two-layer networks in the NTK regime to obtain closed-form insights into how GLU reshapes the spectrum and reduces the condition number. Extending this derivation to deeper networks is technically challenging due to the complexity of the NTK in the presence of residuals and multiple layers, and we do not claim to have done so. Instead, the two-layer case serves as a theoretical foundation, and we support the extrapolation with empirical evidence from training ViT and GPT-2 models, where GLU accelerates optimization without major generalization improvements. We will revise the manuscript to more explicitly state this scope and add a paragraph discussing the assumptions involved in applying the insights to modern LLMs. revision: partial
-
Referee: [Training dynamics analysis] The training-dynamics claims, including the loss-crossing phenomenon, rest on the two-layer NTK spectrum result. Without additional analysis showing that the same spectral properties control convergence once depth, layer-wise non-linearities, and residual connections are introduced, the link between the derived condition-number improvement and observed behavior in modern models remains an assumption rather than a demonstrated mechanism.
Authors: The loss-crossing is an empirical observation in our experiments on deeper models. The NTK analysis in two layers explains a plausible mechanism via improved conditioning leading to faster convergence. While we agree that demonstrating the same spectral control in deep residual networks would provide stronger evidence, our current contribution is to identify this mechanism in a tractable setting and show consistency with practice. We will update the discussion to clarify that the link is based on the simplified model and empirical corroboration, rather than a complete proof for all architectures. revision: partial
- Deriving or bounding the NTK spectrum for deep residual networks with GLU and non-linear activations.
Circularity Check
NTK spectrum analysis for two-layer networks is derived independently without reducing to fitted inputs or self-citations
full rationale
The paper's core derivation analyzes the NTK for two-layer networks under the GLU activation to obtain the eigenvalue spectrum, condition number, and resulting convergence rates directly from the kernel expressions and activation properties. These steps are mathematical derivations rather than fits to the target advantage or self-referential definitions. No load-bearing self-citation chains, ansatzes smuggled via prior work, or renaming of known results appear in the described chain; the generalization-gap experiments are presented as separate empirical observations. The analysis is self-contained within its stated two-layer NTK regime and does not reduce the claimed spectrum reshaping to its own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption NTK regime holds for the two-layer networks under consideration
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.