{"work":{"id":"a3c30ead-1625-4c18-a9c1-e4928dcd0da6","openalex_id":null,"doi":null,"arxiv_id":"2201.02177","raw_key":null,"title":"Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets","authors":null,"authors_text":"Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin, Vedant Misra","year":2022,"venue":"cs.LG","abstract":"In this paper we propose to study generalization of neural networks on small algorithmically generated datasets. In this setting, questions about data efficiency, memorization, generalization, and speed of learning can be studied in great detail. In some situations we show that neural networks learn through a process of \"grokking\" a pattern in the data, improving generalization performance from random chance level to perfect generalization, and that this improvement in generalization can happen well past the point of overfitting. We also study generalization as a function of dataset size and find that smaller datasets require increasing amounts of optimization for generalization. We argue that these datasets provide a fertile ground for studying a poorly understood aspect of deep learning: generalization of overparametrized neural networks beyond memorization of the finite training dataset.","external_url":"https://arxiv.org/abs/2201.02177","cited_by_count":null,"metadata_source":"pith","metadata_fetched_at":"2026-05-14T21:52:56.128071+00:00","pith_arxiv_id":"2201.02177","created_at":"2026-05-09T06:50:41.163960+00:00","updated_at":"2026-05-14T21:52:56.128071+00:00","title_quality_ok":true,"display_title":"Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets","render_title":"Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets"},"hub":{"state":{"work_id":"a3c30ead-1625-4c18-a9c1-e4928dcd0da6","tier":"hub","tier_reason":"10+ Pith inbound or 1,000+ external citations","pith_inbound_count":45,"external_cited_by_count":null,"distinct_field_count":8,"first_pith_cited_at":"2022-07-11T22:59:39+00:00","last_pith_cited_at":"2026-05-12T16:57:48+00:00","author_build_status":"not_needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"not_needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-05-14T23:06:18.640338+00:00","tier_text":"hub"},"tier":"hub","role_counts":[{"context_role":"background","n":1}],"polarity_counts":[{"context_polarity":"background","n":1}],"runs":{"context_extract":{"job_type":"context_extract","status":"succeeded","result":{"enqueued_papers":25},"error":null,"updated_at":"2026-05-14T17:36:21.974202+00:00"},"graph_features":{"job_type":"graph_features","status":"succeeded","result":{"co_cited":[{"title":"Progress measures for grokking via mechanistic interpretability","work_id":"edcfc489-e023-4b21-bb0a-4d1307915fd3","shared_citers":10},{"title":"Scaling Laws for Neural Language Models","work_id":"b7dd8749-9c45-4977-ab9b-64478dce1ae8","shared_citers":8},{"title":"Adam: A Method for Stochastic Optimization","work_id":"1910796d-9b52-4683-bf5c-de9632c1028b","shared_citers":6},{"title":"Training Compute-Optimal Large Language Models","work_id":"b2faf28d-86b7-429c-bc42-469458efc246","shared_citers":6},{"title":"Evaluating Large Language Models Trained on Code","work_id":"042493e9-b26f-4b4e-bbde-382072ca9b08","shared_citers":5},{"title":"In-context Learning and Induction Heads","work_id":"db2b0911-2758-4a2a-99dc-15b14b91bd5e","shared_citers":5},{"title":"Exact solutions to the nonlinear dynamics of learning in deep linear neural networks","work_id":"adbbf9c7-c3a4-4cb7-9a00-c98b12f8a315","shared_citers":4},{"title":"Grokfast: Accelerated grokking by amplifying slow gradients","work_id":"d749e06f-c567-426b-a50b-bfa87ef2f746","shared_citers":4},{"title":"On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima","work_id":"01efb355-7c12-4b89-bc42-91ee46ee276b","shared_citers":4},{"title":"//arxiv.org/abs/1703.00810","work_id":"3b14f412-2206-469d-bfd3-c387c75ea711","shared_citers":3},{"title":"Decoupled Weight Decay Regularization","work_id":"07ef7360-d385-4033-83f7-8384a6325204","shared_citers":3},{"title":"Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer","work_id":"50e3b368-0243-4726-8186-233869802ad1","shared_citers":3},{"title":"Gaussian Error Linear Units (GELUs)","work_id":"0466fd22-03a1-4a61-af0a-a900e77bb023","shared_citers":3},{"title":"Gradient descent happens in a tiny subspace.arXiv preprint arXiv:1812.04754","work_id":"4c83c7ac-e217-4d73-b433-14f6b522da36","shared_citers":3},{"title":"Llama 2: Open Foundation and Fine-Tuned Chat Models","work_id":"68a5177f-d644-44c1-bd4f-4e5278c22f5d","shared_citers":3},{"title":"Measuring Massive Multitask Language Understanding","work_id":"e87ec49a-544b-4ec8-8991-75298c64ff5e","shared_citers":3},{"title":"Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt","work_id":"04fe7340-1319-435d-a5e6-86ab8a7b0417","shared_citers":3},{"title":"Omnigrok: Grokking beyond algorithmic data.arXiv preprint arXiv:2210.01117","work_id":"3326c4c5-e049-47b2-b4f7-83ad97665931","shared_citers":3},{"title":"Scaling Language Models: Methods, Analysis & Insights from Training Gopher","work_id":"47ce8be9-e500-407d-af41-ac2d132215eb","shared_citers":3},{"title":"Sharpness-aware minimization for efficiently improving generalization.arXiv preprint arXiv:2010.01412","work_id":"0f502a27-fa5d-459b-a595-548dc0691a9c","shared_citers":3},{"title":"Training Verifiers to Solve Math Word Problems","work_id":"acab1aa8-b4d6-40e0-a3ee-25341701dca2","shared_citers":3},{"title":"","work_id":"421fd22c-0ffb-4333-9732-837277428420","shared_citers":2},{"title":"22 Kaiqiang Song, Xiaoyang Wang, Sangwoo Cho, Xiaoman Pan, and Dong Yu","work_id":"07bf7000-64b4-4c9d-9f3d-cac739d79cf3","shared_citers":2},{"title":"Advances in neural information processing systems , volume=","work_id":"183b369f-7b01-45b9-9177-1f32a7bd9f57","shared_citers":2}],"time_series":[{"n":3,"year":2022},{"n":1,"year":2023},{"n":1,"year":2024},{"n":37,"year":2026}],"dependency_candidates":[]},"error":null,"updated_at":"2026-05-14T17:46:17.031006+00:00"},"identity_refresh":{"job_type":"identity_refresh","status":"succeeded","result":{"items":[{"title":"Qwen3 Technical Report","outcome":"unchanged","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","resolver":"local_arxiv","confidence":0.98,"old_work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e"}],"counts":{"fixed":0,"merged":0,"unchanged":1,"quarantined":0,"needs_external_resolution":0},"errors":[],"attempted":1},"error":null,"updated_at":"2026-05-14T17:36:18.513925+00:00"},"summary_claims":{"job_type":"summary_claims","status":"succeeded","result":{"title":"Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets","claims":[{"claim_text":"In this paper we propose to study generalization of neural networks on small algorithmically generated datasets. In this setting, questions about data efficiency, memorization, generalization, and speed of learning can be studied in great detail. In some situations we show that neural networks learn through a process of \"grokking\" a pattern in the data, improving generalization performance from random chance level to perfect generalization, and that this improvement in generalization can happen well past the point of overfitting. We also study generalization as a function of dataset size and f","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets because it crossed a citation-hub threshold.","role_counts":[]},"error":null,"updated_at":"2026-05-14T17:46:19.562560+00:00"}},"summary":{"title":"Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets","claims":[{"claim_text":"In this paper we propose to study generalization of neural networks on small algorithmically generated datasets. In this setting, questions about data efficiency, memorization, generalization, and speed of learning can be studied in great detail. In some situations we show that neural networks learn through a process of \"grokking\" a pattern in the data, improving generalization performance from random chance level to perfect generalization, and that this improvement in generalization can happen well past the point of overfitting. We also study generalization as a function of dataset size and f","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets because it crossed a citation-hub threshold.","role_counts":[]},"graph":{"co_cited":[{"title":"Progress measures for grokking via mechanistic interpretability","work_id":"edcfc489-e023-4b21-bb0a-4d1307915fd3","shared_citers":10},{"title":"Scaling Laws for Neural Language Models","work_id":"b7dd8749-9c45-4977-ab9b-64478dce1ae8","shared_citers":8},{"title":"Adam: A Method for Stochastic Optimization","work_id":"1910796d-9b52-4683-bf5c-de9632c1028b","shared_citers":6},{"title":"Training Compute-Optimal Large Language Models","work_id":"b2faf28d-86b7-429c-bc42-469458efc246","shared_citers":6},{"title":"Evaluating Large Language Models Trained on Code","work_id":"042493e9-b26f-4b4e-bbde-382072ca9b08","shared_citers":5},{"title":"In-context Learning and Induction Heads","work_id":"db2b0911-2758-4a2a-99dc-15b14b91bd5e","shared_citers":5},{"title":"Exact solutions to the nonlinear dynamics of learning in deep linear neural networks","work_id":"adbbf9c7-c3a4-4cb7-9a00-c98b12f8a315","shared_citers":4},{"title":"Grokfast: Accelerated grokking by amplifying slow gradients","work_id":"d749e06f-c567-426b-a50b-bfa87ef2f746","shared_citers":4},{"title":"On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima","work_id":"01efb355-7c12-4b89-bc42-91ee46ee276b","shared_citers":4},{"title":"//arxiv.org/abs/1703.00810","work_id":"3b14f412-2206-469d-bfd3-c387c75ea711","shared_citers":3},{"title":"Decoupled Weight Decay Regularization","work_id":"07ef7360-d385-4033-83f7-8384a6325204","shared_citers":3},{"title":"Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer","work_id":"50e3b368-0243-4726-8186-233869802ad1","shared_citers":3},{"title":"Gaussian Error Linear Units (GELUs)","work_id":"0466fd22-03a1-4a61-af0a-a900e77bb023","shared_citers":3},{"title":"Gradient descent happens in a tiny subspace.arXiv preprint arXiv:1812.04754","work_id":"4c83c7ac-e217-4d73-b433-14f6b522da36","shared_citers":3},{"title":"Llama 2: Open Foundation and Fine-Tuned Chat Models","work_id":"68a5177f-d644-44c1-bd4f-4e5278c22f5d","shared_citers":3},{"title":"Measuring Massive Multitask Language Understanding","work_id":"e87ec49a-544b-4ec8-8991-75298c64ff5e","shared_citers":3},{"title":"Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt","work_id":"04fe7340-1319-435d-a5e6-86ab8a7b0417","shared_citers":3},{"title":"Omnigrok: Grokking beyond algorithmic data.arXiv preprint arXiv:2210.01117","work_id":"3326c4c5-e049-47b2-b4f7-83ad97665931","shared_citers":3},{"title":"Scaling Language Models: Methods, Analysis & Insights from Training Gopher","work_id":"47ce8be9-e500-407d-af41-ac2d132215eb","shared_citers":3},{"title":"Sharpness-aware minimization for efficiently improving generalization.arXiv preprint arXiv:2010.01412","work_id":"0f502a27-fa5d-459b-a595-548dc0691a9c","shared_citers":3},{"title":"Training Verifiers to Solve Math Word Problems","work_id":"acab1aa8-b4d6-40e0-a3ee-25341701dca2","shared_citers":3},{"title":"","work_id":"421fd22c-0ffb-4333-9732-837277428420","shared_citers":2},{"title":"22 Kaiqiang Song, Xiaoyang Wang, Sangwoo Cho, Xiaoman Pan, and Dong Yu","work_id":"07bf7000-64b4-4c9d-9f3d-cac739d79cf3","shared_citers":2},{"title":"Advances in neural information processing systems , volume=","work_id":"183b369f-7b01-45b9-9177-1f32a7bd9f57","shared_citers":2}],"time_series":[{"n":3,"year":2022},{"n":1,"year":2023},{"n":1,"year":2024},{"n":37,"year":2026}],"dependency_candidates":[]},"authors":[]}}