pith. sign in

arxiv: 2605.20749 · v2 · pith:DI2TYUIYnew · submitted 2026-05-20 · 💻 cs.LG · cs.AI

The Devil is in the Condition Numbers: Why is GLU Better than non-GLU Structure?

Pith reviewed 2026-05-21 06:43 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords gated linear unitsneural tangent kernelcondition numbertraining dynamicsconvergence rategeneralizationlarge language models
0
0 comments X

The pith

GLU structures reshape the NTK spectrum to lower its condition number and speed up convergence.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper seeks to explain the consistent advantage of gated linear units over standard linear units in modern language models. It does so by examining two-layer networks in the neural tangent kernel regime, where the gating mechanism alters the spectrum of the kernel. The result is a smaller condition number and more compact eigenvalues, which improve the speed of training. The work also finds that this change does not substantially narrow the generalization gap, indicating the benefit is mostly about reaching low loss faster rather than achieving better final performance.

Core claim

The GLU structure reshapes the NTK spectrum, leading to a smaller condition number and a more compact eigenvalue distribution. This spectral property improves training dynamics, causing GLU models to converge faster than non-GLU models and producing a characteristic loss-crossing phenomenon. On models such as ViT and GPT-2, GLU shows limited effect on reducing the generalization gap, so its main value is accelerating optimization.

What carries the argument

The neural tangent kernel (NTK) spectrum of the model, whose condition number and eigenvalue distribution the GLU modifies to enable quicker gradient descent progress.

Load-bearing premise

The spectral effects observed in two-layer NTK analysis carry over to explain GLU advantages in deep nonlinear networks used in practice.

What would settle it

A direct computation showing that the NTK condition number remains unchanged or increases with GLU in a deep network would challenge the link between the two-layer analysis and real LLM performance.

Figures

Figures reproduced from arXiv: 2605.20749 by Peisong Wen, Qianqian Xu, Qingming Huang, Xingyu Lyu, Zhiyong Yang.

Figure 1
Figure 1. Figure 1: Illustration of our theoretical discovery. We find that (A) by adding the gating structure, (B) the NTK matrix of GLU structure becomes better conditioned, which (C) explains the faster optimization of GLU-based models. convergence of training process in the Neural Tangent Kernel (NTK) regime. While directly analyzing neural net￾work optimization is inherently challenging, the NTK frame￾work provides theor… view at source ↗
Figure 2
Figure 2. Figure 2: Comparison between theoretical and numerical NTK condition numbers for ReLU and ReGLU models. The theoreti￾cal predictions closely track the numerically computed condition numbers. ReLU GELU SiLU FFN Structure 0 10000 20000 30000 40000 50000 60000 Condition Number Non-GLU GLU (a) ReLU GELU SiLU FFN Structure 0 20 40 60 80 100 Condition Number Non-GLU GLU (b) [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Condition number of (a) ViT and (b) GPT-2 under different activation choices. Gram matrix XX⊤/d. As is shown in Fig.4, this gating mechanism significantly enhances the diagonal domi￾nance of the kernel: diagonal entries become more pro￾nounced, while off-diagonal entries are suppressed. In fact, by definition, the off-diagonal entries of NTK matrix denotes the model gradient correlation between different s… view at source ↗
Figure 4
Figure 4. Figure 4: Visualization of the NTK matrices for ReLU (left) and ReGLU (right) models. 4. Training Dynamics in the Kernel Regime: From NTK Spectrum to Loss Crossing In previous section, we showed that GLU induces a more diagonally dominant NTK, leading to a smaller maximal eigenvalue and a larger minimal eigenvalue, and hence a more contracted spectrum. In this section, we show how this spectral reshaping leads to di… view at source ↗
Figure 5
Figure 5. Figure 5: Training trajectories in a two-sample toy model for ReLU and ReGLU. (a) 30 steps. (b) 70 steps. (c) 150 steps. (d) 1000 steps. 4.2. Loss-Crossing Phenomenon We now formalize the above intuition and connect it to the loss-crossing phenomenon. Our analysis is in the infinite￾width limit (m → ∞), where the model operates in the kernel regime, and the expected training loss admits a closed￾form expression. Pro… view at source ↗
Figure 6
Figure 6. Figure 6: Training loss curves on two-layer MLP models. (a) Gaussian data with learning rate 0.005. (b) MNIST with learning rate 1 × 10−5 . (c) Gaussian data with learning rate 0.008. (d) MNIST with learning rate 5 × 10−5 . 5. Generalization Gap Analysis Finally, we examine whether GLU variants help reduce the generalization gap. Intuitively, the multiplicative gating in GLU introduces second-order nonlinearity, whi… view at source ↗
Figure 8
Figure 8. Figure 8: Generalization gap vs training loss with MLP Mixer trained on Tiny ImageNet. (a) ReLU vs ReGLU. (b) GELU vs GEGLU. (c) SiLU vs SwiGLU. 2.5 3.0 3.5 4.0 4.5 5.0 Ltrain 0.0 0.1 0.2 0.3 0.4 0.5 0.6 L t e s t L t ra i n p-value: 0.1770 ViT on Tiny ImageNet ReGLU ReLU (a) 2.5 3.0 3.5 4.0 4.5 5.0 Ltrain 0.0 0.1 0.2 0.3 0.4 0.5 0.6 L t e s t L t ra i n p-value: 0.2080 ViT on Tiny ImageNet GEGLU GELU (b) 3.0 3.5 4.… view at source ↗
Figure 9
Figure 9. Figure 9: Generalization gap vs training loss with ViT trained on Tiny ImageNet. (a) ReLU vs ReGLU. (b) GELU vs GEGLU. (c) SiLU vs SwiGLU. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Generalization gap vs training loss with MLP Mixer trained on CIFAR-10. (a) ReLU vs ReGLU. (b) GELU vs GEGLU. (c) SiLU vs SwiGLU. 0.6 0.8 1.0 1.2 1.4 1.6 1.8 Ltrain 0.100 0.075 0.050 0.025 0.000 0.025 0.050 L t e s t L t ra i n p-value: 0.1310 ViT on CIFAR-10 ReGLU ReLU (a) 0.6 0.8 1.0 1.2 1.4 1.6 1.8 Ltrain 0.08 0.06 0.04 0.02 0.00 0.02 0.04 0.06 L t e s t L t ra i n p-value: 0.1270 ViT on CIFAR-10 GEGLU… view at source ↗
Figure 11
Figure 11. Figure 11: Generalization gap vs training loss with ViT trained on CIFAR-10. (a) ReLU vs ReGLU. (b) GELU vs GEGLU. (c) SiLU vs SwiGLU. 4.0 4.5 5.0 5.5 6.0 Ltrain 0.5 0.4 0.3 0.2 0.1 0.0 0.1 L v al L t r ain p-value: 0.0660 GPT-2 on FineWeb-Edu GELU GEGLU (a) 4.0 4.5 5.0 5.5 6.0 Ltrain 0.6 0.5 0.4 0.3 0.2 0.1 0.0 0.1 L v al L t r ain p-value: 0.4430 GPT-2 on FineWeb-Edu SiLU SwiGLU (b) [PITH_FULL_IMAGE:figures/full_… view at source ↗
Figure 12
Figure 12. Figure 12: Generalization gap vs training loss with GPT-2 trained on FineWeb-Edu. (a) GELU vs GEGLU. (b) SiLU vs SwiGLU. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_12.png] view at source ↗
read the original abstract

Gated Linear Units (GLU) and their variants are widely adopted in modern open-source large language model architectures and consistently outperform their non-gated counterparts, yet the underlying reasons for this advantage remain unclear. In this work, we study GLU by analyzing two-layer networks in the neural tangent kernel (NTK) regime. Our analysis reveals that the GLU structure reshapes the NTK spectrum, leading to a smaller condition number and a more compact eigenvalue distribution. Building on this finding, we further analyze the resulting training dynamics and show how the reshaped spectrum leads to faster convergence of GLU models, including a characteristic loss-crossing phenomenon observed between GLU and non-GLU models. Finally, we empirically observe that GLU has limited impact in reducing the generalization gap on various models, including ViT and GPT-2, suggesting that its primary benefit lies in accelerating optimization rather than reducing the generalization gap. The code is available at: https://github.com/Zemdalk/GLU-NTK.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper analyzes Gated Linear Units (GLU) in two-layer networks under the neural tangent kernel (NTK) regime, claiming that GLU reshapes the NTK spectrum to yield a smaller condition number and more compact eigenvalue distribution. This spectral change is then linked to faster convergence and a loss-crossing phenomenon in training dynamics. Empirical results on ViT and GPT-2 are presented to show that GLU has limited impact on reducing the generalization gap, suggesting its primary benefit is optimization speed rather than generalization.

Significance. If the NTK spectral reshaping and resulting condition-number reduction can be shown to persist or dominate in deeper, non-linear architectures with residuals, the work would offer a concrete theoretical account for the empirical superiority of GLU variants in modern LLMs, shifting focus from generalization to optimization dynamics. The reported generalization-gap experiments on ViT and GPT-2 provide useful negative evidence that strengthens the optimization-focused interpretation.

major comments (2)
  1. [Abstract and NTK analysis sections] The central derivation of spectrum reshaping, smaller condition number, and faster convergence is performed only for two-layer networks in the NTK regime (abstract and the analysis sections). The manuscript does not derive, bound, or empirically compute the NTK spectrum (or effective condition number) for any architecture deeper than two layers, leaving the extrapolation to deep LLMs (including the GPT-2 experiments) unverified and load-bearing for the stated explanation of GLU's advantage.
  2. [Training dynamics analysis] The training-dynamics claims, including the loss-crossing phenomenon, rest on the two-layer NTK spectrum result. Without additional analysis showing that the same spectral properties control convergence once depth, layer-wise non-linearities, and residual connections are introduced, the link between the derived condition-number improvement and observed behavior in modern models remains an assumption rather than a demonstrated mechanism.
minor comments (2)
  1. [Empirical section] Dataset details, error bars, and hyper-parameter choices for the ViT and GPT-2 generalization-gap experiments should be expanded to allow readers to assess whether post-hoc fitting affects the reported limited impact on generalization.
  2. [NTK derivation] Notation for the NTK eigenvalues and condition number should be made fully explicit (including any dependence on the gating parameters) to clarify that the reported improvement is not an artifact of parameter choices.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback and for highlighting the importance of verifying the NTK spectral properties in deeper architectures. We agree that our theoretical results are derived for two-layer networks and that extending them rigorously to deep models with residuals is a valuable direction for future work. Below, we address the major comments point by point.

read point-by-point responses
  1. Referee: [Abstract and NTK analysis sections] The central derivation of spectrum reshaping, smaller condition number, and faster convergence is performed only for two-layer networks in the NTK regime (abstract and the analysis sections). The manuscript does not derive, bound, or empirically compute the NTK spectrum (or effective condition number) for any architecture deeper than two layers, leaving the extrapolation to deep LLMs (including the GPT-2 experiments) unverified and load-bearing for the stated explanation of GLU's advantage.

    Authors: We acknowledge this limitation. Our analysis focuses on two-layer networks in the NTK regime to obtain closed-form insights into how GLU reshapes the spectrum and reduces the condition number. Extending this derivation to deeper networks is technically challenging due to the complexity of the NTK in the presence of residuals and multiple layers, and we do not claim to have done so. Instead, the two-layer case serves as a theoretical foundation, and we support the extrapolation with empirical evidence from training ViT and GPT-2 models, where GLU accelerates optimization without major generalization improvements. We will revise the manuscript to more explicitly state this scope and add a paragraph discussing the assumptions involved in applying the insights to modern LLMs. revision: partial

  2. Referee: [Training dynamics analysis] The training-dynamics claims, including the loss-crossing phenomenon, rest on the two-layer NTK spectrum result. Without additional analysis showing that the same spectral properties control convergence once depth, layer-wise non-linearities, and residual connections are introduced, the link between the derived condition-number improvement and observed behavior in modern models remains an assumption rather than a demonstrated mechanism.

    Authors: The loss-crossing is an empirical observation in our experiments on deeper models. The NTK analysis in two layers explains a plausible mechanism via improved conditioning leading to faster convergence. While we agree that demonstrating the same spectral control in deep residual networks would provide stronger evidence, our current contribution is to identify this mechanism in a tractable setting and show consistency with practice. We will update the discussion to clarify that the link is based on the simplified model and empirical corroboration, rather than a complete proof for all architectures. revision: partial

standing simulated objections not resolved
  • Deriving or bounding the NTK spectrum for deep residual networks with GLU and non-linear activations.

Circularity Check

0 steps flagged

NTK spectrum analysis for two-layer networks is derived independently without reducing to fitted inputs or self-citations

full rationale

The paper's core derivation analyzes the NTK for two-layer networks under the GLU activation to obtain the eigenvalue spectrum, condition number, and resulting convergence rates directly from the kernel expressions and activation properties. These steps are mathematical derivations rather than fits to the target advantage or self-referential definitions. No load-bearing self-citation chains, ansatzes smuggled via prior work, or renaming of known results appear in the described chain; the generalization-gap experiments are presented as separate empirical observations. The analysis is self-contained within its stated two-layer NTK regime and does not reduce the claimed spectrum reshaping to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the validity of the NTK approximation for two-layer networks and the assumption that spectrum properties directly govern convergence speed in the regimes studied. No free parameters or invented entities are mentioned in the abstract.

axioms (1)
  • domain assumption NTK regime holds for the two-layer networks under consideration
    Invoked to justify the spectrum analysis that produces the condition-number claim.

pith-pipeline@v0.9.0 · 5703 in / 1343 out tokens · 29798 ms · 2026-05-21T06:43:31.331333+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.