Implicit Regularization of Stochastic Gradient Descent in Natural Language Processing: Observations and Implications

Deren Lei , Zichen Sun , Yijun Xiao , William Yang Wang

Authors on Pith no claims yet

classification 💻 cs.CL cs.LGcs.NE

keywords regularizationimplicitlearningdeepeffectgeneralizationneuralcertain

read the original abstract

Deep neural networks with remarkably strong generalization performances are usually over-parameterized. Despite explicit regularization strategies are used for practitioners to avoid over-fitting, the impacts are often small. Some theoretical studies have analyzed the implicit regularization effect of stochastic gradient descent (SGD) on simple machine learning models with certain assumptions. However, how it behaves practically in state-of-the-art models and real-world datasets is still unknown. To bridge this gap, we study the role of SGD implicit regularization in deep learning systems. We show pure SGD tends to converge to minimas that have better generalization performances in multiple natural language processing (NLP) tasks. This phenomenon coexists with dropout, an explicit regularizer. In addition, neural network's finite learning capability does not impact the intrinsic nature of SGD's implicit regularization effect. Specifically, under limited training samples or with certain corrupted labels, the implicit regularization effect remains strong. We further analyze the stability by varying the weight initialization range. We corroborate these experimental findings with a decision boundary visualization using a 3-layer neural network for interpretation. Altogether, our work enables a deepened understanding on how implicit regularization affects the deep learning model and sheds light on the future study of the over-parameterized model's generalization ability.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Behavior Score Prediction in Resting-State Functional MRI by Deep State Space Modeling
eess.SP 2026-02 unverdicted novelty 6.0

A deep state space model on rs-fMRI time series predicts Alzheimer's behavior scores better than functional connectivity approaches and identifies key predictive brain regions.