pith. sign in

hub

L2 Regularization versus Batch and Weight Normalization

18 Pith papers cite this work. Polarity classification is still indexing.

18 Pith papers citing it
abstract

Batch Normalization is a commonly used trick to improve the training of deep neural networks. These neural networks use L2 regularization, also called weight decay, ostensibly to prevent overfitting. However, we show that L2 regularization has no regularizing effect when combined with normalization. Instead, regularization has an influence on the scale of weights, and thereby on the effective learning rate. We investigate this dependence, both in theory, and experimentally. We show that popular optimization methods such as ADAM only partially eliminate the influence of normalization on the learning rate. This leads to a discussion on other ways to mitigate this issue.

hub tools

citation-role summary

background 3 method 1

citation-polarity summary

clear filters

representative citing papers

Muown Implicitly Performs Angular Step-size Decay

cs.LG · 2026-06-22 · conditional · novelty 6.0

Muown's update implicitly decays angular step size via magnitude modulation; AngularMuown decouples and schedules angular steps explicitly, yielding better empirical results.

Preserving Plasticity in Continual Learning via Dynamical Isometry

cs.LG · 2026-06-08 · unverdicted · novelty 6.0

Dynamical isometry (Jacobian singular values near 1) preserves plasticity in continual learning; an isometry-promoting regularizer and decoupled AdamO optimizer match or beat prior methods on supervised and RL benchmarks.

Does Weight Decay Enhance Training Stability?

cs.LG · 2026-05-15 · conditional · novelty 6.0

Weight decay slows progressive sharpening at the edge of stability, inducing damped oscillations in CNNs and a phase transition to sub-2/η sharpness in MLPs driven by parameter-sharpness gradient alignment, yielding more stable NTK dynamics.

Demystifying Manifold Constraints in LLM Pre-training

cs.LG · 2026-05-06 · unverdicted · novelty 6.0

Manifold constraints via the new MACRO optimizer independently bound activation scales and enforce rotational equilibrium in LLM pre-training, subsuming RMS normalization and decoupled weight decay while delivering competitive performance with convergence guarantees.

Anytime Training with Schedule-Free Spectral Optimization

cs.LG · 2026-05-21 · unverdicted · novelty 5.0

SF-NorMuon is a new schedule-free spectral optimizer that closes the gap with tuned AdamW on 125M-772M parameter models across 1-8x Chinchilla horizons while providing stationarity guarantees.

Adaptive Norm-Based Regularization for Neural Networks

stat.ML · 2026-04-30 · unverdicted · novelty 5.0

Covariance-aware ridge and combined l1-l2 regularizers for neural networks yield better predictive performance and complexity control than standard penalties in simulations and applications to cooling-load prediction and leukemia classification.

citing papers explorer

Showing 14 of 14 citing papers after filters.