Implicit bias produces neural scaling laws in learning curves, from perceptrons to deep networks
Pith reviewed 2026-05-22 14:18 UTC · model grok-4.3
The pith
Gradient descent implicit bias produces two dynamical scaling laws that describe performance over the full training curve and recover the standard final test error scaling.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Gradient-based training induces an implicit bias that produces two novel dynamical scaling laws governing how performance evolves as a function of different norm-based complexity measures. Combined, these laws recover the well-known scaling for test error at convergence. The result holds across CNNs, ResNets, and Vision Transformers trained on MNIST, CIFAR-10, and CIFAR-100, and receives analytical support from a single-layer perceptron with logistic loss where the laws are derived directly from the implicit bias.
What carries the argument
the implicit bias induced by gradient-based training, which steers optimization toward solutions whose evolving norms produce power-law relationships between performance and complexity measures throughout training
If this is right
- Performance follows predictable power-law relationships with respect to norm-based complexity measures at every stage of training, not only at convergence.
- The two dynamical laws together account for the observed scaling of test error with model or data size once training ends.
- The same mechanism and scaling behavior appear consistently from single-layer perceptrons to CNNs, ResNets, and Vision Transformers.
- Norm-based complexity measures serve as the natural variables for tracking how learning curves evolve under gradient training.
Where Pith is reading between the lines
- Changing the optimizer or loss function to reduce or remove the usual implicit bias should alter or eliminate the dynamical scaling laws.
- If the laws hold, intermediate performance could be predicted from early norm measurements without completing full training runs.
- The framework links scaling laws to the geometry of the optimization path rather than to final model capacity alone.
Load-bearing premise
The implicit bias induced by gradient-based training is the primary mechanism that produces the two dynamical scaling laws, both in the perceptron derivation and in the deep-network experiments.
What would settle it
Train the single-layer perceptron with an optimizer lacking the same implicit bias, such as one that explicitly constrains the norm differently, and check whether the two dynamical scaling laws with the predicted exponents still appear in the performance-versus-norm curves.
read the original abstract
Scaling laws in deep learning -- empirical power-law relationships linking model performance to resource growth -- have emerged as simple yet striking regularities across architectures, datasets, and tasks. These laws are particularly impactful in guiding the design of state-of-the-art models, since they quantify the benefits of increasing data or model size, and hint at the foundations of interpretability in machine learning. However, most studies focus on asymptotic behavior at the end of training. In this work, we describe a richer picture by analyzing the entire training dynamics: we identify two novel \textit{dynamical} scaling laws that govern how performance evolves as function of different norm-based complexity measures. Combined, our new laws recover the well-known scaling for test error at convergence. Our findings are consistent across CNNs, ResNets, and Vision Transformers trained on MNIST, CIFAR-10 and CIFAR-100. Furthermore, we provide analytical support using a single-layer perceptron trained with logistic loss, where we derive the new dynamical scaling laws, and we explain them through the implicit bias induced by gradient-based training.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims to identify two novel dynamical scaling laws that describe how performance evolves as a function of different norm-based complexity measures during training. These laws, when combined, recover the well-known scaling for test error at convergence. Analytical support is given via a derivation for a single-layer perceptron trained with logistic loss, while empirical consistency is shown for CNNs, ResNets, and Vision Transformers on MNIST, CIFAR-10, and CIFAR-100; the phenomena are attributed to the implicit bias induced by gradient-based training.
Significance. If the results hold, the work provides a mechanistic account of scaling laws rooted in training dynamics rather than solely asymptotic behavior, bridging a simple analytically tractable model to modern architectures. The explicit link to implicit bias and the recovery of the standard convergence scaling from the two dynamical laws would be a notable contribution to understanding why power-law relationships appear across models and datasets.
major comments (3)
- [§4] §4 (Perceptron analysis): The derivation of the dynamical scaling laws relies on the continuous-time gradient-flow ODE for logistic loss on separable data. This assumption is load-bearing for the central claim, yet the manuscript does not quantify how discrete SGD steps or non-separable regimes alter the predicted exponents or functional forms.
- [§5] §5 (Deep-network experiments): The observed scaling laws are reported under standard SGD, but no ablations compare optimizers with demonstrably different implicit biases (e.g., Adam versus SGD, or SGD with explicit weight decay). Without such controls, it remains unclear whether the reported power laws are produced specifically by implicit bias or could arise from finite-width effects or data-dependent feature learning.
- [§2–3] §2–3 (Definition of dynamical scaling laws): The exponents appearing in the proposed scaling forms are not shown to be parameter-free analytic predictions; if they are obtained by fitting to observed curves, the claim that implicit bias “produces” the laws rather than merely being consistent with them would require additional justification.
minor comments (2)
- [Figures 2–4] Figure captions for the norm-based complexity plots should explicitly state the precise norm (e.g., L2 of weights versus margin-normalized) used in each panel to avoid ambiguity when comparing to the theoretical predictions.
- [§5] The manuscript would benefit from a short discussion of how early-stopping or learning-rate schedules interact with the reported dynamical laws, even if only as a minor robustness check.
Simulated Author's Rebuttal
We thank the referee for the positive assessment and constructive major comments, which help clarify the scope and robustness of our claims on dynamical scaling laws induced by implicit bias. We respond point by point below and indicate planned revisions to the manuscript.
read point-by-point responses
-
Referee: §4 (Perceptron analysis): The derivation of the dynamical scaling laws relies on the continuous-time gradient-flow ODE for logistic loss on separable data. This assumption is load-bearing for the central claim, yet the manuscript does not quantify how discrete SGD steps or non-separable regimes alter the predicted exponents or functional forms.
Authors: We agree that the continuous-time gradient-flow analysis on separable data is central to the exact derivation. This limit is standard for characterizing implicit bias in linear classifiers and yields the parameter-free exponents in the dynamical laws. Discrete SGD approximates the flow for small step sizes, and non-separable regimes would require different analysis. In the revision we will add a dedicated limitations paragraph discussing these assumptions, including a brief numerical comparison of discrete versus continuous dynamics for the perceptron to illustrate that the leading exponents remain consistent. revision: partial
-
Referee: §5 (Deep-network experiments): The observed scaling laws are reported under standard SGD, but no ablations compare optimizers with demonstrably different implicit biases (e.g., Adam versus SGD, or SGD with explicit weight decay). Without such controls, it remains unclear whether the reported power laws are produced specifically by implicit bias or could arise from finite-width effects or data-dependent feature learning.
Authors: We appreciate this suggestion for strengthening the causal link to implicit bias. Our experiments deliberately employ vanilla SGD without weight decay to align with the gradient-flow bias analyzed in the perceptron. In the revised version we will include new ablations using Adam and SGD with weight decay on the same CNN, ResNet, and ViT models and datasets. These controls are expected to produce deviations from the reported scaling laws, thereby supporting that the laws are tied to the specific implicit bias rather than finite-width or feature-learning effects alone. revision: yes
-
Referee: §2–3 (Definition of dynamical scaling laws): The exponents appearing in the proposed scaling forms are not shown to be parameter-free analytic predictions; if they are obtained by fitting to observed curves, the claim that implicit bias “produces” the laws rather than merely being consistent with them would require additional justification.
Authors: The scaling forms and their exponents are first derived analytically from the implicit bias of gradient flow on the perceptron, producing parameter-free predictions. Sections 2–3 present the general functional forms motivated by this derivation, while the deep-network results serve as empirical validation that observed exponents match the predicted values. We will revise the text to state this distinction more explicitly and add a comparison table of analytically predicted versus empirically measured exponents across architectures. revision: partial
Circularity Check
Analytical derivation for perceptron is independent; deep-net results are empirical consistency checks
full rationale
The paper derives the two dynamical scaling laws analytically for the single-layer perceptron under logistic loss and continuous-time gradient flow, starting from the implicit bias toward max-margin solutions on separable data. These derivations use the ODE approximation of gradient descent and produce the scaling forms directly from the dynamics without fitting exponents or normalizations to the target curves. The recovery of the known test-error scaling at convergence follows by combining the two derived laws rather than by re-fitting. For deep networks the results are presented as empirical consistency across CNNs, ResNets and ViTs on MNIST/CIFAR, not as a derivation. No load-bearing step reduces to a self-citation, a fitted parameter renamed as prediction, or an ansatz smuggled from prior work by the same authors. The central claim therefore remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- dynamical scaling exponents
axioms (1)
- domain assumption Gradient descent on logistic loss induces implicit bias toward minimum-norm solutions
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leandAlembert_cosh_solution_aczel echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
logistic loss … Vλ(Δμ) = −1/λ (λΔμ − log 2 cosh(λΔμ))
-
IndisputableMonolith/Foundation/BranchSelection.leanbranch_selection unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
implicit bias … maximum-margin classifier (Soudry et al., 2018)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.