{"work":{"id":"f3dc32a4-cf81-467b-8ff4-3b2f21d3bf1f","openalex_id":null,"doi":null,"arxiv_id":"1706.02677","raw_key":null,"title":"Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour","authors":null,"authors_text":"Priya Goyal, Piotr Doll\\'ar, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola","year":2017,"venue":"cs.CV","abstract":"Deep learning thrives with large neural networks and large datasets. However, larger networks and larger datasets result in longer training times that impede research and development progress. Distributed synchronous SGD offers a potential solution to this problem by dividing SGD minibatches over a pool of parallel workers. Yet to make this scheme efficient, the per-worker workload must be large, which implies nontrivial growth in the SGD minibatch size. In this paper, we empirically show that on the ImageNet dataset large minibatches cause optimization difficulties, but when these are addressed the trained networks exhibit good generalization. Specifically, we show no loss of accuracy when training with large minibatch sizes up to 8192 images. To achieve this result, we adopt a hyper-parameter-free linear scaling rule for adjusting learning rates as a function of minibatch size and develop a new warmup scheme that overcomes optimization challenges early in training. With these simple techniques, our Caffe2-based system trains ResNet-50 with a minibatch size of 8192 on 256 GPUs in one hour, while matching small minibatch accuracy. Using commodity hardware, our implementation achieves ~90% scaling efficiency when moving from 8 to 256 GPUs. Our findings enable training visual recognition models on internet-scale data with high efficiency.","external_url":"https://arxiv.org/abs/1706.02677","cited_by_count":null,"metadata_source":"pith","metadata_fetched_at":"2026-05-14T19:32:52.175466+00:00","pith_arxiv_id":"1706.02677","created_at":"2026-05-09T03:55:07.762877+00:00","updated_at":"2026-05-14T19:32:52.175466+00:00","title_quality_ok":false,"display_title":"Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour","render_title":"Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour"},"hub":{"state":{"work_id":"f3dc32a4-cf81-467b-8ff4-3b2f21d3bf1f","tier":"hub","tier_reason":"10+ Pith inbound or 1,000+ external citations","pith_inbound_count":40,"external_cited_by_count":null,"distinct_field_count":12,"first_pith_cited_at":"2019-09-17T19:42:54+00:00","last_pith_cited_at":"2026-05-13T12:27:22+00:00","author_build_status":"not_needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"not_needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-05-14T23:06:18.763961+00:00","tier_text":"hub"},"tier":"hub","role_counts":[{"context_role":"background","n":1}],"polarity_counts":[{"context_polarity":"background","n":1}],"runs":{"context_extract":{"job_type":"context_extract","status":"succeeded","result":{"enqueued_papers":25},"error":null,"updated_at":"2026-05-14T17:49:43.046648+00:00"},"graph_features":{"job_type":"graph_features","status":"succeeded","result":{"co_cited":[{"title":"Scaling Laws for Neural Language Models","work_id":"b7dd8749-9c45-4977-ab9b-64478dce1ae8","shared_citers":12},{"title":"Decoupled Weight Decay Regularization","work_id":"07ef7360-d385-4033-83f7-8384a6325204","shared_citers":11},{"title":"Adam: A Method for Stochastic Optimization","work_id":"1910796d-9b52-4683-bf5c-de9632c1028b","shared_citers":9},{"title":"Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism","work_id":"c888e6d1-0b1d-43d6-9ef5-f0912a0efa1b","shared_citers":8},{"title":"LLaMA: Open and Efficient Foundation Language Models","work_id":"c018fc23-6f3f-4035-9d02-28a2173b2b9d","shared_citers":7},{"title":"Training Compute-Optimal Large Language Models","work_id":"b2faf28d-86b7-429c-bc42-469458efc246","shared_citers":7},{"title":"An empirical model of large-batch training, 2018, arXiv:1812.06162 http://arxiv.org/abs/","work_id":"f3989b96-1ee9-403f-a0e2-0342ac16bcb7","shared_citers":6},{"title":"Deep Learning Scaling is Predictable, Empirically","work_id":"3638ccb4-3a4f-460e-8b6f-867a65922801","shared_citers":6},{"title":"Generating Long Sequences with Sparse Transformers","work_id":"c5b81688-45ee-4a9a-b095-e6290f45cb6c","shared_citers":6},{"title":"Large batch training of convolutional networks","work_id":"92799584-c1e5-4828-bbfe-0771b7fe8706","shared_citers":6},{"title":"arXiv preprint arXiv:1404.5997 , year=","work_id":"ad32a219-5e6c-4d6b-999b-487749b8c9c3","shared_citers":5},{"title":"Auto-Encoding Variational Bayes","work_id":"97d95295-30e1-42b4-bbf6-85f0fa4edb44","shared_citers":5},{"title":"Don’t decay the learning rate, increase the batch size","work_id":"04ddda12-c77f-444d-88b2-5f4786276d69","shared_citers":5},{"title":"Evaluating Large Language Models Trained on Code","work_id":"042493e9-b26f-4b4e-bbde-382072ca9b08","shared_citers":5},{"title":"Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer","work_id":"50e3b368-0243-4726-8186-233869802ad1","shared_citers":5},{"title":"Gaussian Error Linear Units (GELUs)","work_id":"0466fd22-03a1-4a61-af0a-a900e77bb023","shared_citers":5},{"title":"GLU Variants Improve Transformer","work_id":"17d0763c-1016-41ab-a478-478e890765eb","shared_citers":5},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":5},{"title":"Language Models are Few-Shot Learners","work_id":"214732c0-2edd-44a0-af9e-28184a2b8279","shared_citers":5},{"title":"Scaling Laws for Autoregressive Generative Modeling","work_id":"1f180c21-02d6-4b11-9dfc-08d7f0d8fc81","shared_citers":5},{"title":"Training Verifiers to Solve Math Word Problems","work_id":"acab1aa8-b4d6-40e0-a3ee-25341701dca2","shared_citers":5},{"title":"ALBERT: A Lite BERT for Self-supervised Learning of Language Representations","work_id":"aedf7950-7c35-4e28-a32d-bec290f51669","shared_citers":4},{"title":"BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding","work_id":"ed240a10-5b19-406c-baa5-30803f465785","shared_citers":4},{"title":"Le, and Zhifeng Chen","work_id":"37375911-ccd2-4498-abd8-45af00a7f04f","shared_citers":4}],"time_series":[{"n":1,"year":2019},{"n":1,"year":2020},{"n":2,"year":2021},{"n":3,"year":2022},{"n":2,"year":2023},{"n":2,"year":2024},{"n":3,"year":2025},{"n":25,"year":2026}],"dependency_candidates":[]},"error":null,"updated_at":"2026-05-14T17:49:25.708331+00:00"},"identity_refresh":{"job_type":"identity_refresh","status":"succeeded","result":{"items":[{"title":"Qwen3 Technical Report","outcome":"unchanged","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","resolver":"local_arxiv","confidence":0.98,"old_work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e"}],"counts":{"fixed":0,"merged":0,"unchanged":1,"quarantined":0,"needs_external_resolution":0},"errors":[],"attempted":1},"error":null,"updated_at":"2026-05-14T17:49:46.942876+00:00"},"summary_claims":{"job_type":"summary_claims","status":"succeeded","result":{"title":"Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour","claims":[{"claim_text":"Deep learning thrives with large neural networks and large datasets. However, larger networks and larger datasets result in longer training times that impede research and development progress. Distributed synchronous SGD offers a potential solution to this problem by dividing SGD minibatches over a pool of parallel workers. Yet to make this scheme efficient, the per-worker workload must be large, which implies nontrivial growth in the SGD minibatch size. In this paper, we empirically show that on the ImageNet dataset large minibatches cause optimization difficulties, but when these are address","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour because it crossed a citation-hub threshold.","role_counts":[]},"error":null,"updated_at":"2026-05-14T17:49:57.378641+00:00"}},"summary":{"title":"Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour","claims":[{"claim_text":"Deep learning thrives with large neural networks and large datasets. However, larger networks and larger datasets result in longer training times that impede research and development progress. Distributed synchronous SGD offers a potential solution to this problem by dividing SGD minibatches over a pool of parallel workers. Yet to make this scheme efficient, the per-worker workload must be large, which implies nontrivial growth in the SGD minibatch size. In this paper, we empirically show that on the ImageNet dataset large minibatches cause optimization difficulties, but when these are address","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour because it crossed a citation-hub threshold.","role_counts":[]},"graph":{"co_cited":[{"title":"Scaling Laws for Neural Language Models","work_id":"b7dd8749-9c45-4977-ab9b-64478dce1ae8","shared_citers":12},{"title":"Decoupled Weight Decay Regularization","work_id":"07ef7360-d385-4033-83f7-8384a6325204","shared_citers":11},{"title":"Adam: A Method for Stochastic Optimization","work_id":"1910796d-9b52-4683-bf5c-de9632c1028b","shared_citers":9},{"title":"Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism","work_id":"c888e6d1-0b1d-43d6-9ef5-f0912a0efa1b","shared_citers":8},{"title":"LLaMA: Open and Efficient Foundation Language Models","work_id":"c018fc23-6f3f-4035-9d02-28a2173b2b9d","shared_citers":7},{"title":"Training Compute-Optimal Large Language Models","work_id":"b2faf28d-86b7-429c-bc42-469458efc246","shared_citers":7},{"title":"An empirical model of large-batch training, 2018, arXiv:1812.06162 http://arxiv.org/abs/","work_id":"f3989b96-1ee9-403f-a0e2-0342ac16bcb7","shared_citers":6},{"title":"Deep Learning Scaling is Predictable, Empirically","work_id":"3638ccb4-3a4f-460e-8b6f-867a65922801","shared_citers":6},{"title":"Generating Long Sequences with Sparse Transformers","work_id":"c5b81688-45ee-4a9a-b095-e6290f45cb6c","shared_citers":6},{"title":"Large batch training of convolutional networks","work_id":"92799584-c1e5-4828-bbfe-0771b7fe8706","shared_citers":6},{"title":"arXiv preprint arXiv:1404.5997 , year=","work_id":"ad32a219-5e6c-4d6b-999b-487749b8c9c3","shared_citers":5},{"title":"Auto-Encoding Variational Bayes","work_id":"97d95295-30e1-42b4-bbf6-85f0fa4edb44","shared_citers":5},{"title":"Don’t decay the learning rate, increase the batch size","work_id":"04ddda12-c77f-444d-88b2-5f4786276d69","shared_citers":5},{"title":"Evaluating Large Language Models Trained on Code","work_id":"042493e9-b26f-4b4e-bbde-382072ca9b08","shared_citers":5},{"title":"Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer","work_id":"50e3b368-0243-4726-8186-233869802ad1","shared_citers":5},{"title":"Gaussian Error Linear Units (GELUs)","work_id":"0466fd22-03a1-4a61-af0a-a900e77bb023","shared_citers":5},{"title":"GLU Variants Improve Transformer","work_id":"17d0763c-1016-41ab-a478-478e890765eb","shared_citers":5},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":5},{"title":"Language Models are Few-Shot Learners","work_id":"214732c0-2edd-44a0-af9e-28184a2b8279","shared_citers":5},{"title":"Scaling Laws for Autoregressive Generative Modeling","work_id":"1f180c21-02d6-4b11-9dfc-08d7f0d8fc81","shared_citers":5},{"title":"Training Verifiers to Solve Math Word Problems","work_id":"acab1aa8-b4d6-40e0-a3ee-25341701dca2","shared_citers":5},{"title":"ALBERT: A Lite BERT for Self-supervised Learning of Language Representations","work_id":"aedf7950-7c35-4e28-a32d-bec290f51669","shared_citers":4},{"title":"BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding","work_id":"ed240a10-5b19-406c-baa5-30803f465785","shared_citers":4},{"title":"Le, and Zhifeng Chen","work_id":"37375911-ccd2-4498-abd8-45af00a7f04f","shared_citers":4}],"time_series":[{"n":1,"year":2019},{"n":1,"year":2020},{"n":2,"year":2021},{"n":3,"year":2022},{"n":2,"year":2023},{"n":2,"year":2024},{"n":3,"year":2025},{"n":25,"year":2026}],"dependency_candidates":[]},"authors":[]}}