Deep double descent: where bigger models and more data hurt*

Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, Ilya Sutskever · 2021 · DOI 10.1088/1742-5468/ac3a74

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

open at publisher browse 3 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

The Norm-Separation Delay Law of Grokking: A First-Principles Theory of Delayed Generalization

cs.AI · 2026-03-05 · conditional · novelty 7.0

Grokking delay follows T_grok - T_mem = Θ(γ_eff^{-1} log(‖θ_mem‖² / ‖θ_post‖²)), derived from norm separation in regularized optimization and validated with high correlations across 293 runs.

Understanding Goal Generalisation in Sequential Reinforcement Learning

cs.LG · 2026-05-22 · unverdicted · novelty 6.0

Empirical analysis of over 100 sequential RL training pipelines across 250+ OOD environments finds salient features drive generalization and early goals persist, with latent policy gradients simulating latent variable evolution to predict OOD behavior from training history.

Practical Scaling Laws: Converting Compute into Performance in a Data-Constrained World

cs.LG · 2026-05-09 · conditional · novelty 6.0

A new scaling law L(N, D, T) = E + (L0 - E) h/(1+h) with h = a/N^α + b/T^β + c N^γ/D^δ that decomposes loss into undercapacity, undertraining, and overfitting terms and saturates between E and L0.

citing papers explorer

Showing 3 of 3 citing papers.

The Norm-Separation Delay Law of Grokking: A First-Principles Theory of Delayed Generalization cs.AI · 2026-03-05 · conditional · none · ref 8
Grokking delay follows T_grok - T_mem = Θ(γ_eff^{-1} log(‖θ_mem‖² / ‖θ_post‖²)), derived from norm separation in regularized optimization and validated with high correlations across 293 runs.
Understanding Goal Generalisation in Sequential Reinforcement Learning cs.LG · 2026-05-22 · unverdicted · none · ref 45
Empirical analysis of over 100 sequential RL training pipelines across 250+ OOD environments finds salient features drive generalization and early goals persist, with latent policy gradients simulating latent variable evolution to predict OOD behavior from training history.
Practical Scaling Laws: Converting Compute into Performance in a Data-Constrained World cs.LG · 2026-05-09 · conditional · none · ref 37
A new scaling law L(N, D, T) = E + (L0 - E) h/(1+h) with h = a/N^α + b/T^β + c N^γ/D^δ that decomposes loss into undercapacity, undertraining, and overfitting terms and saturates between E and L0.

Deep double descent: where bigger models and more data hurt*

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer