Neural networks exhibit grokking on small algorithmic datasets, achieving perfect generalization well after overfitting.
Adding gradient noise improves learning for very deep networks
7 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
The regularized Transformer loss satisfies Villani's coercive energy criteria, yielding log-Sobolev constants C_LS ≤ λ^{-1} + d/λ² and finite-time convergence bounds for noisy SGD.
Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.
ST-MoE introduces stability techniques for sparse expert models, allowing a 269B-parameter model to achieve state-of-the-art transfer learning results across reasoning, summarization, and QA tasks at the compute cost of a 32B dense model.
Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.
SignSGD with pre-sign dithering and a calibrated hybrid switch to SGD achieves 92.18% accuracy on CIFAR-10 with ResNet-18, outperforming pure SGD and SignSGD, plus better results than Adam on CIFAR-100.
Scalar-irreducible dynamics enable internally generated regime transitions in learning systems via feedback between fast dynamical variables and slow structural adaptation.
citing papers explorer
-
Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets
Neural networks exhibit grokking on small algorithmic datasets, achieving perfect generalization well after overfitting.
-
Weight-Decay Turns Transformer Loss Landscapes Villani: Functional-Analytic Foundations for Optimization and Generalization
The regularized Transformer loss satisfies Villani's coercive energy criteria, yielding log-Sobolev constants C_LS ≤ λ^{-1} + d/λ² and finite-time convergence bounds for noisy SGD.
-
Enhancing SignSGD: Small-Batch Convergence Analysis and a Hybrid Switching Strategy
SignSGD with pre-sign dithering and a calibrated hybrid switch to SGD achieves 92.18% accuracy on CIFAR-10 with ResNet-18, outperforming pure SGD and SignSGD, plus better results than Adam on CIFAR-100.
-
Endogenous Regime Switching Driven by Scalar-Irreducible Learning Dynamics
Scalar-irreducible dynamics enable internally generated regime transitions in learning systems via feedback between fast dynamical variables and slow structural adaptation.