Neural LoFi models deep learning as layer-wise spectral filtering that selects maximal low-degree correlations, yielding a tractable surrogate for hierarchical representation learning beyond the lazy regime.
Inductive bias and spectral properties of single-head attention in high dimensions
4 Pith papers cite this work. Polarity classification is still indexing.
abstract
Trained attention layers exhibit striking and reproducible spectral structure of the weights, including low-rank collapse, bulk deformation, and isolated spectral outliers, yet the origin of these phenomena and their implications for generalization remain poorly understood. We study empirical risk minimization in a single-head tied-attention layer trained on synthetic high-dimensional sequence tasks generated from the attention-indexed model. Using tools from random matrix theory, spin-glass theory, and approximate message passing, we obtain an exact high-dimensional characterization of training and test error, interpolation and recovery thresholds, and the spectrum of the key and query matrices. Our theory predicts the full singular-value distribution of the trained query-key map, including low-rank structure and isolated spectral outliers, in qualitative agreement with observations in more realistic transformers. Finally, for targets with power-law spectra, we show that learning proceeds through sequential spectral recovery, leading to the emergence of power-law scaling laws.
citation-role summary
citation-polarity summary
years
2026 4roles
background 1polarities
background 1representative citing papers
In a solvable attention model, pre-training followed by rank-one LoRA admits sharp asymptotic predictions for test errors and representation alignment via an effective noise term.
Attention pooling produces a free-multiplicative-convolution bulk spectrum and two phase transitions for signal recovery; optimal weights are the top eigenvector of the positional correlation matrix R.
Quadratic two-layer networks exhibit data-dependent power-law generalization scaling with distinct regimes in width and sample size, including an interpolation transition whose location depends on target spectrum.
citing papers explorer
-
Deep Learning as Neural Low-Degree Filtering: A Spectral Theory of Hierarchical Feature Learning
Neural LoFi models deep learning as layer-wise spectral filtering that selects maximal low-degree correlations, yielding a tractable surrogate for hierarchical representation learning beyond the lazy regime.
-
High-Dimensional Theory of LoRA Fine-Tuning in a Solvable Attention Model
In a solvable attention model, pre-training followed by rank-one LoRA admits sharp asymptotic predictions for test errors and representation alignment via an effective noise term.
-
How Does Attention Help? Insights from Random Matrices on Signal Recovery from Sequence Models
Attention pooling produces a free-multiplicative-convolution bulk spectrum and two phase transitions for signal recovery; optimal weights are the top eigenvector of the positional correlation matrix R.
-
How Width and Data Shape Generalization Scaling Laws in Quadratic Neural Networks
Quadratic two-layer networks exhibit data-dependent power-law generalization scaling with distinct regimes in width and sample size, including an interpolation transition whose location depends on target spectrum.