Gradient Descent Happens in a Tiny Subspace

Daniel A. Roberts; Ethan Dyer; Guy Gur-Ari

arxiv: 1812.04754 · v1 · pith:32UK2ZWInew · submitted 2018-12-12 · 💻 cs.LG · cs.AI· stat.ML

Gradient Descent Happens in a Tiny Subspace

Guy Gur-Ari , Daniel A. Roberts , Ethan Dyer This is my paper

classification 💻 cs.LG cs.AIstat.ML

keywords subspacegradientdescentlearningmostlytrainingargumentclasses

0 comments

read the original abstract

We show that in a variety of large-scale deep learning scenarios the gradient dynamically converges to a very small subspace after a short period of training. The subspace is spanned by a few top eigenvectors of the Hessian (equal to the number of classes in the dataset), and is mostly preserved over long periods of training. A simple argument then suggests that gradient descent may happen mostly in this subspace. We give an example of this effect in a solvable model of classification, and we comment on possible implications for optimization and learning.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Scaling Laws for Neural Language Models
cs.LG 2020-01 unverdicted novelty 8.0

Empirical power-law scaling governs language model loss versus model size, data size, and compute, enabling optimal allocation of training compute.
AMUSE: Anytime Muon with Stable Gradient Evaluation
cs.LG 2026-05 unverdicted novelty 7.0

AMUSE is a new optimizer integrating Muon orthogonalization with Schedule-Free averaging via adaptive interpolation for schedule-free anytime training that improves Pareto frontiers on vision and LLM tasks.
The Lifecycle of the Spectral Edge: From Gradient Learning to Weight-Decay Compression
cs.LG 2026-04 unverdicted novelty 7.0

The spectral edge transitions from a gradient-driven functional direction before grokking to a perturbation-flat, ablation-critical compression axis at grokking, forming three universality classes predicted by a gap f...
Depth, Not Data: An Analysis of Hessian Spectral Bifurcation
cs.LG 2026-01 unverdicted novelty 7.0

Deep linear networks with balanced data covariance exhibit Hessian spectral bifurcation whose dominant-to-bulk eigenvalue ratio scales linearly with depth.
GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection
cs.LG 2024-03 conditional novelty 7.0

GaLore performs full-parameter LLM training with up to 65.5% less optimizer memory by projecting gradients onto a low-rank subspace at each step, matching full-rank performance on LLaMA pre-training and RoBERTa fine-tuning.
Dr. Post-Training: A Data Regularization Perspective on LLM Post-Training
cs.LG 2026-05 unverdicted novelty 6.0

Dr. Post-Training reframes general data as a data-induced regularizer for LLM post-training updates, yielding a family of methods that outperform data-selection baselines on SFT, RLHF, and RLVR tasks.
Pro-KLShampoo: Projected KL-Shampoo with Whitening Recovered by Orthogonalization
cs.LG 2026-05 unverdicted novelty 6.0

Pro-KLShampoo projects KL-Shampoo preconditioners to a spike-and-flat parametric form on an r-dimensional subspace and recovers the full algebraic preconditioner via orthogonalization, outperforming KL-Shampoo on GPT-...
Spectral Lens: Activation and Gradient Spectra as Diagnostics of LLM Optimization
stat.ML 2026-05 unverdicted novelty 6.0

Spectral analysis of activations and gradients provides new diagnostics that link batch size to representation geometry, early covariance tails to token efficiency, and spectral shifts to learning dynamics in decoder-...
TLoRA: Task-aware Low Rank Adaptation of Large Language Models
cs.CL 2026-04 unverdicted novelty 6.0

TLoRA jointly optimizes LoRA initialization via task-data SVD and sensitivity-driven rank allocation, delivering stronger results than standard LoRA across NLU, reasoning, math, code, and chat tasks while using fewer ...
Exploiting Correlations in Federated Learning: Opportunities and Practical Limitations
cs.IT 2026-04 unverdicted novelty 6.0

A correlation-based taxonomy unifies existing FL compression methods, experiments show correlation strengths vary by task and architecture, and adaptive mode-switching designs are proposed to exploit this.
Grokking as Dimensional Phase Transition in Neural Networks
cs.LG 2026-04 unverdicted novelty 6.0

Grokking occurs as the effective dimensionality of the gradient field transitions from sub-diffusive to super-diffusive at the onset of generalization, exhibiting self-organized criticality.
Spectral Edge Dynamics: An Analytical-Empirical Study of Phase Transitions in Neural Network Training
cs.LG 2026-03 unverdicted novelty 6.0

Spectral gaps in the Gram matrix of parameter updates control phase transitions such as grokking in neural network training.
AGZO: Activation-Guided Zeroth-Order Optimization for LLM Fine-Tuning
cs.LG 2026-01 unverdicted novelty 6.0

AGZO restricts ZO perturbations to an activation-derived low-rank subspace, claiming higher gradient cosine similarity and better benchmark performance than isotropic ZO baselines on Qwen3 and Pangu models.
Language Models (Mostly) Know What They Know
cs.CL 2022-07 unverdicted novelty 6.0

Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.
A General Language Assistant as a Laboratory for Alignment
cs.CL 2021-12 conditional novelty 6.0

Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.
Scaling Laws for Transfer
cs.LG 2021-02 unverdicted novelty 6.0

Effective data transferred from pre-training to fine-tuning is described by a power law in model parameter count and fine-tuning dataset size, acting like a multiplier on the fine-tuning data.
DBLP: Phase-Aware Bounded-Loss Transport for Burst-Resilient Distributed ML Training
cs.LG 2026-05 unverdicted novelty 5.0

DBLP is a training-phase-aware bounded-loss transport protocol that reduces end-to-end distributed ML training time by 24.4% on average (up to 33.9%) and achieves up to 5.88x communication speedup during microbursts w...
On the Convergence Analysis of Muon
stat.ML 2025-05 unverdicted novelty 5.0

Convergence analysis shows Muon outperforms gradient descent by exploiting low-rank structure in neural network Hessians.
Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model
cs.CV 2025-02 unverdicted novelty 4.0

Step-Video-T2V describes a 30B-parameter text-to-video model with custom Video-VAE, 3D DiT, flow matching, and Video-DPO that claims state-of-the-art results on a new internal benchmark.
Hessian based analysis of SGD for Deep Nets: Dynamics and Generalization
cs.LG 2019-07 unverdicted novelty 4.0

Provides Hessian-based theoretical characterizations of SGD dynamics and a scale-invariant generalization bound for deep nets, backed by experiments on synthetic data, MNIST, and CIFAR-10.