Inductive bias and spectral properties of single-head attention in high dimensions

Fabrizio Boncoraglio, Vittorio Erba, Emanuele Troiani, Florent Krzakala, Lenka Zdeborová · 2025 · stat.ML · arXiv 2509.24914

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

open full Pith review browse 4 citing papers arXiv PDF

abstract

Trained attention layers exhibit striking and reproducible spectral structure of the weights, including low-rank collapse, bulk deformation, and isolated spectral outliers, yet the origin of these phenomena and their implications for generalization remain poorly understood. We study empirical risk minimization in a single-head tied-attention layer trained on synthetic high-dimensional sequence tasks generated from the attention-indexed model. Using tools from random matrix theory, spin-glass theory, and approximate message passing, we obtain an exact high-dimensional characterization of training and test error, interpolation and recovery thresholds, and the spectrum of the key and query matrices. Our theory predicts the full singular-value distribution of the trained query-key map, including low-rank structure and isolated spectral outliers, in qualitative agreement with observations in more realistic transformers. Finally, for targets with power-law spectra, we show that learning proceeds through sequential spectral recovery, leading to the emergence of power-law scaling laws.

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Deep Learning as Neural Low-Degree Filtering: A Spectral Theory of Hierarchical Feature Learning

cs.LG · 2026-05-13 · unverdicted · novelty 8.0

Neural LoFi models deep learning as layer-wise spectral filtering that selects maximal low-degree correlations, yielding a tractable surrogate for hierarchical representation learning beyond the lazy regime.

High-Dimensional Theory of LoRA Fine-Tuning in a Solvable Attention Model

cs.LG · 2026-06-04 · unverdicted · novelty 7.0

In a solvable attention model, pre-training followed by rank-one LoRA admits sharp asymptotic predictions for test errors and representation alignment via an effective noise term.

How Does Attention Help? Insights from Random Matrices on Signal Recovery from Sequence Models

stat.ML · 2026-05-07 · conditional · novelty 7.0

Attention pooling produces a free-multiplicative-convolution bulk spectrum and two phase transitions for signal recovery; optimal weights are the top eigenvector of the positional correlation matrix R.

How Width and Data Shape Generalization Scaling Laws in Quadratic Neural Networks

cs.LG · 2026-06-26 · unverdicted · novelty 5.0

Quadratic two-layer networks exhibit data-dependent power-law generalization scaling with distinct regimes in width and sample size, including an interpolation transition whose location depends on target spectrum.

citing papers explorer

Showing 4 of 4 citing papers.

Deep Learning as Neural Low-Degree Filtering: A Spectral Theory of Hierarchical Feature Learning cs.LG · 2026-05-13 · unverdicted · none · ref 78 · internal anchor
Neural LoFi models deep learning as layer-wise spectral filtering that selects maximal low-degree correlations, yielding a tractable surrogate for hierarchical representation learning beyond the lazy regime.
High-Dimensional Theory of LoRA Fine-Tuning in a Solvable Attention Model cs.LG · 2026-06-04 · unverdicted · none · ref 17 · internal anchor
In a solvable attention model, pre-training followed by rank-one LoRA admits sharp asymptotic predictions for test errors and representation alignment via an effective noise term.
How Does Attention Help? Insights from Random Matrices on Signal Recovery from Sequence Models stat.ML · 2026-05-07 · conditional · none · ref 1 · internal anchor
Attention pooling produces a free-multiplicative-convolution bulk spectrum and two phase transitions for signal recovery; optimal weights are the top eigenvector of the positional correlation matrix R.
How Width and Data Shape Generalization Scaling Laws in Quadratic Neural Networks cs.LG · 2026-06-26 · unverdicted · none · ref 5 · internal anchor
Quadratic two-layer networks exhibit data-dependent power-law generalization scaling with distinct regimes in width and sample size, including an interpolation transition whose location depends on target spectrum.

Inductive bias and spectral properties of single-head attention in high dimensions

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer