Adding gradient noise improves learning for very deep networks

Arvind Neelakantan, Luke Vilnis, Quoc V Le, Ilya Sutskever, Lukasz Kaiser, Karol Kurach, James Martens · 2015 · arXiv 1511.06807

7 Pith papers cite this work. Polarity classification is still indexing.

7 Pith papers citing it

read on arXiv browse 7 citing papers

representative citing papers

Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets

cs.LG · 2022-01-06 · unverdicted · novelty 8.0

Neural networks exhibit grokking on small algorithmic datasets, achieving perfect generalization well after overfitting.

Weight-Decay Turns Transformer Loss Landscapes Villani: Functional-Analytic Foundations for Optimization and Generalization

cs.LG · 2026-05-07 · unverdicted · novelty 7.0

The regularized Transformer loss satisfies Villani's coercive energy criteria, yielding log-Sobolev constants C_LS ≤ λ^{-1} + d/λ² and finite-time convergence bounds for noisy SGD.

Language Models (Mostly) Know What They Know

cs.CL · 2022-07-11 · unverdicted · novelty 6.0

Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.

ST-MoE: Designing Stable and Transferable Sparse Expert Models

cs.CL · 2022-02-17 · unverdicted · novelty 6.0

ST-MoE introduces stability techniques for sparse expert models, allowing a 269B-parameter model to achieve state-of-the-art transfer learning results across reasoning, summarization, and QA tasks at the compute cost of a 32B dense model.

A General Language Assistant as a Laboratory for Alignment

cs.CL · 2021-12-01 · conditional · novelty 6.0

Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.

Enhancing SignSGD: Small-Batch Convergence Analysis and a Hybrid Switching Strategy

cs.LG · 2026-04-28 · unverdicted · novelty 5.0

SignSGD with pre-sign dithering and a calibrated hybrid switch to SGD achieves 92.18% accuracy on CIFAR-10 with ResNet-18, outperforming pure SGD and SignSGD, plus better results than Adam on CIFAR-100.

Endogenous Regime Switching Driven by Scalar-Irreducible Learning Dynamics

cs.LG · 2026-04-10 · unverdicted · novelty 4.0

Scalar-irreducible dynamics enable internally generated regime transitions in learning systems via feedback between fast dynamical variables and slow structural adaptation.

citing papers explorer

Showing 4 of 4 citing papers after filters.

Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets cs.LG · 2022-01-06 · unverdicted · none · ref 11
Neural networks exhibit grokking on small algorithmic datasets, achieving perfect generalization well after overfitting.
Weight-Decay Turns Transformer Loss Landscapes Villani: Functional-Analytic Foundations for Optimization and Generalization cs.LG · 2026-05-07 · unverdicted · none · ref 14
The regularized Transformer loss satisfies Villani's coercive energy criteria, yielding log-Sobolev constants C_LS ≤ λ^{-1} + d/λ² and finite-time convergence bounds for noisy SGD.
Enhancing SignSGD: Small-Batch Convergence Analysis and a Hybrid Switching Strategy cs.LG · 2026-04-28 · unverdicted · none · ref 9
SignSGD with pre-sign dithering and a calibrated hybrid switch to SGD achieves 92.18% accuracy on CIFAR-10 with ResNet-18, outperforming pure SGD and SignSGD, plus better results than Adam on CIFAR-100.
Endogenous Regime Switching Driven by Scalar-Irreducible Learning Dynamics cs.LG · 2026-04-10 · unverdicted · none · ref 13
Scalar-irreducible dynamics enable internally generated regime transitions in learning systems via feedback between fast dynamical variables and slow structural adaptation.

Adding gradient noise improves learning for very deep networks

fields

years

verdicts

representative citing papers

citing papers explorer