Three Mechanisms of Weight Decay Regularization

Guodong Zhang , Chaoqi Wang , Bowen Xu , Roger Grosse

Authors on Pith no claims yet

classification 💻 cs.LG stat.ML

keywords regularizationdecayweightoptimizationthreeeffecteffectivemechanisms

read the original abstract

Weight decay is one of the standard tricks in the neural network toolbox, but the reasons for its regularization effect are poorly understood, and recent results have cast doubt on the traditional interpretation in terms of $L_2$ regularization. Literal weight decay has been shown to outperform $L_2$ regularization for optimizers for which they differ. We empirically investigate weight decay for three optimization algorithms (SGD, Adam, and K-FAC) and a variety of network architectures. We identify three distinct mechanisms by which weight decay exerts a regularization effect, depending on the particular optimization algorithm and architecture: (1) increasing the effective learning rate, (2) approximately regularizing the input-output Jacobian norm, and (3) reducing the effective damping coefficient for second-order optimization. Our results provide insight into how to improve the regularization of neural networks.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Vibrational infrared and Raman spectra of the methanol molecule with equivariant neural-network property surfaces
physics.chem-ph 2026-02 unverdicted novelty 7.0

Equivariant neural networks produce dipole and polarizability surfaces for methanol that enable variational computation of vibrational IR and Raman spectra agreeing with experiment to 2.2 cm^{-1} RMSD on fundamentals.
Demystifying Manifold Constraints in LLM Pre-training
cs.LG 2026-05 unverdicted novelty 6.0

Manifold constraints via the new MACRO optimizer independently bound activation scales and enforce rotational equilibrium in LLM pre-training, subsuming RMS normalization and decoupled weight decay while delivering co...
Dante: An Open Source Model Pre-Training and Fine-Tuning Tool for the Dafne Federated Framework for Medical Image Segmentation
eess.IV 2026-05 unverdicted novelty 3.0

Dante is a new open-source backend for the Dafne ecosystem that implements configurable training from scratch, layer freezing, and channel-wise LoRA for medical image segmentation, with validation showing faster conve...