Recognition: unknown
Gradient Descent Happens in a Tiny Subspace
read the original abstract
We show that in a variety of large-scale deep learning scenarios the gradient dynamically converges to a very small subspace after a short period of training. The subspace is spanned by a few top eigenvectors of the Hessian (equal to the number of classes in the dataset), and is mostly preserved over long periods of training. A simple argument then suggests that gradient descent may happen mostly in this subspace. We give an example of this effect in a solvable model of classification, and we comment on possible implications for optimization and learning.
This paper has not been read by Pith yet.
Forward citations
Cited by 12 Pith papers
-
The Lifecycle of the Spectral Edge: From Gradient Learning to Weight-Decay Compression
The spectral edge transitions from a gradient-driven functional direction before grokking to a perturbation-flat, ablation-critical compression axis at grokking, forming three universality classes predicted by a gap f...
-
Dr. Post-Training: A Data Regularization Perspective on LLM Post-Training
Dr. Post-Training reframes general data as a data-induced regularizer for LLM post-training updates, yielding a family of methods that outperform data-selection baselines on SFT, RLHF, and RLVR tasks.
-
Pro-KLShampoo: Projected KL-Shampoo with Whitening Recovered by Orthogonalization
Pro-KLShampoo projects KL-Shampoo preconditioners to a spike-and-flat parametric form on an r-dimensional subspace and recovers the full algebraic preconditioner via orthogonalization, outperforming KL-Shampoo on GPT-...
-
Spectral Lens: Activation and Gradient Spectra as Diagnostics of LLM Optimization
Spectral analysis of activations and gradients provides new diagnostics that link batch size to representation geometry, early covariance tails to token efficiency, and spectral shifts to learning dynamics in decoder-...
-
TLoRA: Task-aware Low Rank Adaptation of Large Language Models
TLoRA jointly optimizes LoRA initialization via task-data SVD and sensitivity-driven rank allocation, delivering stronger results than standard LoRA across NLU, reasoning, math, code, and chat tasks while using fewer ...
-
Exploiting Correlations in Federated Learning: Opportunities and Practical Limitations
A correlation-based taxonomy unifies existing FL compression methods, experiments show correlation strengths vary by task and architecture, and adaptive mode-switching designs are proposed to exploit this.
-
Grokking as Dimensional Phase Transition in Neural Networks
Grokking occurs as the effective dimensionality of the gradient field transitions from sub-diffusive to super-diffusive at the onset of generalization, exhibiting self-organized criticality.
-
Spectral Edge Dynamics: An Analytical-Empirical Study of Phase Transitions in Neural Network Training
Spectral gaps in the Gram matrix of parameter updates control phase transitions such as grokking in neural network training.
-
Language Models (Mostly) Know What They Know
Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.
-
A General Language Assistant as a Laboratory for Alignment
Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.
-
DBLP: Phase-Aware Bounded-Loss Transport for Burst-Resilient Distributed ML Training
DBLP is a training-phase-aware bounded-loss transport protocol that reduces end-to-end distributed ML training time by 24.4% on average (up to 33.9%) and achieves up to 5.88x communication speedup during microbursts w...
- Scaling Laws for Neural Language Models
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.