The grokking delay in encoder-decoder models on one-step Collatz prediction stems from decoder inability to use early-learned encoder representations of parity and residue structure, with numeral base acting as a strong inductive bias that can raise accuracy from failure to 99.8%.
Grokking as the transition from lazy to rich training dynamics.arXiv preprint arXiv:2310.06110
2 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.LG 2years
2026 2verdicts
UNVERDICTED 2representative citing papers
Empirical tests confirm robust feature repulsion signs but reveal activation-dependent spectral lock-in in grokking, with x^2 yielding rank-2 updates at epoch ~174 and ReLU remaining rank-1.
citing papers explorer
-
The Long Delay to Arithmetic Generalization: When Learned Representations Outrun Behavior
The grokking delay in encoder-decoder models on one-step Collatz prediction stems from decoder inability to use early-learned encoder representations of parity and residue structure, with numeral base acting as a strong inductive bias that can raise accuracy from failure to 99.8%.
-
Feature Repulsion and Spectral Lock-in: An Empirical Study of Two-Layer Network Grokking
Empirical tests confirm robust feature repulsion signs but reveal activation-dependent spectral lock-in in grokking, with x^2 yielding rank-2 updates at epoch ~174 and ReLU remaining rank-1.