arxiv: 1706.02677 · v2 · submitted 2017-06-08 · 💻 cs.CV · cs.DC· cs.LG

Recognition: 2 theorem links

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

Aapo Kyrola, Andrew Tulloch, Kaiming He, Lukasz Wesolowski, Pieter Noordhuis, Piotr Doll\'ar, Priya Goyal, Ross Girshick, Yangqing Jia

Pith reviewed 2026-05-12 07:15 UTC · model grok-4.3

classification 💻 cs.CV cs.DCcs.LG

keywords large minibatch SGDImageNet trainingdistributed trainingResNet-50linear scaling rulewarmup scheduleone-hour trainingsynchronous SGD

0 comments

The pith

ResNet-50 reaches full ImageNet accuracy when trained with 8192-image minibatches on 256 GPUs in one hour.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that large minibatch sizes in distributed synchronous SGD do not reduce final accuracy on ImageNet when early training problems are handled. It applies a linear scaling rule that raises the learning rate in proportion to the minibatch size and adds a short warmup period at the beginning of training. These two changes allow a ResNet-50 network to train with a minibatch of 8192 images across 256 GPUs while matching the accuracy of conventional small-minibatch training. The complete run finishes in one hour and maintains roughly 90 percent scaling efficiency when the number of GPUs grows from 8 to 256. The result removes a practical barrier to training visual recognition models on very large datasets.

Core claim

With a hyper-parameter-free linear scaling rule for the learning rate and a warmup scheme that overcomes early optimization instability, large-minibatch SGD trains ResNet-50 on ImageNet using minibatches of 8192 images on 256 GPUs in one hour while matching the accuracy of small-minibatch training.

What carries the argument

The linear scaling rule that sets the learning rate proportional to minibatch size, combined with a warmup schedule to stabilize the first few epochs of large-minibatch training.

If this is right

ImageNet training time for ResNet-50 drops to one hour on 256 GPUs while preserving accuracy.
Synchronous distributed SGD scales to 256 GPUs with approximately 90 percent efficiency using only commodity hardware.
Visual recognition models can be trained on internet-scale data with high efficiency and no accuracy penalty from large minibatches.
Simple, hyper-parameter-free adjustments suffice to keep generalization intact when minibatch size increases.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same linear scaling plus warmup approach may extend to other convolutional architectures and image-classification datasets.
Optimization difficulties with large batches appear limited to the initial phase rather than altering the quality of the final learned solution.
Even larger minibatches or greater numbers of GPUs could be tested by proportionally extending the warmup length.

Load-bearing premise

The only obstacles to large-minibatch training are early optimization instability and learning-rate magnitude, which can be fixed by linear scaling and warmup without harming final generalization on ImageNet.

What would settle it

Training the same ResNet-50 model with a minibatch size of 8192 but without the warmup schedule or without linear learning-rate scaling would produce lower top-1 accuracy on the ImageNet validation set than the small-minibatch baseline.

read the original abstract

Deep learning thrives with large neural networks and large datasets. However, larger networks and larger datasets result in longer training times that impede research and development progress. Distributed synchronous SGD offers a potential solution to this problem by dividing SGD minibatches over a pool of parallel workers. Yet to make this scheme efficient, the per-worker workload must be large, which implies nontrivial growth in the SGD minibatch size. In this paper, we empirically show that on the ImageNet dataset large minibatches cause optimization difficulties, but when these are addressed the trained networks exhibit good generalization. Specifically, we show no loss of accuracy when training with large minibatch sizes up to 8192 images. To achieve this result, we adopt a hyper-parameter-free linear scaling rule for adjusting learning rates as a function of minibatch size and develop a new warmup scheme that overcomes optimization challenges early in training. With these simple techniques, our Caffe2-based system trains ResNet-50 with a minibatch size of 8192 on 256 GPUs in one hour, while matching small minibatch accuracy. Using commodity hardware, our implementation achieves ~90% scaling efficiency when moving from 8 to 256 GPUs. Our findings enable training visual recognition models on internet-scale data with high efficiency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a working recipe for large-minibatch ImageNet training that preserves accuracy and cuts time to one hour.

read the letter

The main thing to know about this paper is that it demonstrates a practical way to train ResNet-50 on ImageNet using very large minibatches of 8192 images while matching the accuracy of small-batch training. They do this with a linear scaling rule for the learning rate and a warmup schedule, and their Caffe2 implementation completes the training in one hour on 256 GPUs with good scaling efficiency. What is actually new is the empirical result at this scale. The linear scaling rule had been mentioned in earlier work, but showing that it works up to batch size 8192 without hurting final generalization on ImageNet, combined with the warmup to stabilize early training, is the useful advance. The paper also delivers a complete system result, including the wall-clock time and efficiency numbers when going from 8 to 256 GPUs. This makes the contribution more than just an idea. The paper does well by focusing on a concrete problem and providing verifiable outcomes from actual training runs. The accuracy matching is the central evidence, and they report it directly on the standard validation set. Soft spots are minor. They describe the linear rule as hyper-parameter-free, yet the base learning rate and warmup length still need to be set, so some choices remain. The comparison to the small-batch baseline requires confirming that data augmentation, optimizer details, and evaluation are identical; the abstract states they match, but the full paper should have the supporting tables and settings. These points do not undermine the main finding, which comes from direct experiments rather than any fitted derivation. This paper is for researchers and practitioners who train large vision models and want to use more GPUs to reduce wall-clock time. A reader working on distributed deep learning or scaling experiments would get immediate value from the techniques and the reported numbers. It has a clear, testable claim and enough grounding to deserve a serious referee. I would recommend sending it through peer review. The evidence is straightforward and the result addresses a real limitation in scaling training.

Referee Report

0 major / 2 minor

Summary. The paper claims that large minibatch sizes (up to 8192) cause optimization difficulties in SGD for ImageNet training but that these can be addressed via a hyper-parameter-free linear scaling rule for the learning rate combined with a new warmup schedule; with these changes, ResNet-50 reaches equivalent accuracy to small-batch baselines, and a Caffe2 implementation trains it in one hour on 256 GPUs while achieving ~90% scaling efficiency from 8 to 256 GPUs.

Significance. If the empirical result holds, the work provides a practical, simple method to scale synchronous SGD to large batches without accuracy loss on a standard benchmark, directly enabling much faster wall-clock training of visual recognition models and iteration on internet-scale data. Credit is due for the concrete, reproducible Caffe2 system and the scoped, falsifiable claim backed by direct training runs rather than post-hoc fitting.

minor comments (2)

[Abstract] Abstract and results sections: the statement that large-batch training 'matches small minibatch accuracy' would be strengthened by reporting the exact top-1 (and top-5) validation numbers for both the 8192-batch run and the small-batch reference under identical augmentation and evaluation protocols.
[Experimental setup] Methods or experimental setup: explicit confirmation that the small-batch baseline used the same data augmentation, optimizer hyperparameters (apart from the scaled LR), and evaluation protocol is needed to rule out confounding factors in the accuracy comparison.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive review and recommendation to accept the manuscript. We appreciate the recognition of the empirical results, the reproducibility of the Caffe2 implementation, and the practical implications for scaling synchronous SGD.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper's central claim is an empirical observation: ResNet-50 on ImageNet reaches equivalent top-1 accuracy at minibatch size 8192 versus small batches when the learning rate is scaled linearly and a warmup schedule is applied. These rules are used as fixed, hyper-parameter-free multipliers without being fitted or tuned to the reported accuracy numbers. The results derive from direct training runs on the target task rather than any derivation, prediction, or self-citation chain that reduces to the paper's inputs by construction. No load-bearing steps invoke uniqueness theorems, ansatzes smuggled via citation, or renaming of known results.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical validity of the linear scaling rule and the warmup schedule; both are presented as simple, largely hyper-parameter-free fixes whose parameters are not derived from first principles.

free parameters (2)

warmup length
Duration of the initial gradual learning-rate increase is chosen to stabilize early training; exact value not stated in abstract.
base learning rate
Reference learning rate before linear scaling is selected for the small-batch regime.

axioms (1)

domain assumption Linear scaling rule maintains equivalent optimization dynamics when batch size increases
Invoked to justify multiplying the learning rate by the batch-size ratio without further tuning.

pith-pipeline@v0.9.0 · 5558 in / 1354 out tokens · 35125 ms · 2026-05-12T07:15:48.006355+00:00 · methodology

discussion (0)

Forward citations

Cited by 40 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Scalable Distributed Stochastic Optimization via Bidirectional Compression: Beyond Pessimistic Limits
math.OC 2026-05 unverdicted novelty 7.0

Inkheart SGD and M4 use bidirectional compression to achieve time complexities in distributed SGD that improve with worker count n and surpass prior lower bounds under a necessary structural assumption.
TENNOR: Trustworthy Execution for Neural Networks through Obliviousness and Retrievals
cs.CR 2026-05 unverdicted novelty 7.0

TENNOR enables efficient private training of wide neural networks in TEEs by recasting sparsification as doubly oblivious LSH retrievals and introducing MP-WTA to cut hash table memory by 50x while preserving accuracy.
A Provably Robust Multi-Jet Framework applied to Active Flow Control of an Airfoil in Weakly Compressible Flow
physics.flu-dyn 2026-04 unverdicted novelty 7.0

A new injective multi-jet framework for RL flow control provides jet-count-independent running cost upper bounds and enables superior coordinated jet strategies, achieving drag suppression beyond symmetric ideals on c...
Multimodal Reasoning with LLM for Encrypted Traffic Interpretation: A Benchmark
cs.CR 2026-04 unverdicted novelty 7.0

Creates the BGTD benchmark and mmTraffic architecture to enable explainable multimodal interpretation of encrypted network traffic using LLMs.
Fusion and Alignment Enhancement with Large Language Models for Tail-item Sequential Recommendation
cs.IR 2026-04 unverdicted novelty 7.0

FAERec fuses collaborative ID embeddings with LLM semantic embeddings using adaptive gating and dual-level alignment to enhance tail-item sequential recommendations.
Analyzing Reverse Address Translation Overheads in Multi-GPU Scale-Up Pods
cs.DC 2026-04 unverdicted novelty 7.0

Simulation study shows cold TLB misses in reverse address translation dominate latency for small collectives in multi-GPU pods, causing up to 1.4x degradation, while larger ones see diminishing returns.
Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach
cs.LG 2025-02 unverdicted novelty 7.0

A recurrent-depth architecture enables language models to improve reasoning performance by iterating computation in latent space, achieving gains equivalent to much larger models on benchmarks.
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
cs.CL 2024-05 unverdicted novelty 7.0

DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.
Segment Anything
cs.CV 2023-04 unverdicted novelty 7.0

A promptable model trained on 1B masks achieves competitive zero-shot segmentation performance across tasks and is released publicly with its dataset.
Scalable Diffusion Models with Transformers
cs.CV 2022-12 unverdicted novelty 7.0

DiTs achieve SOTA FID of 2.27 on ImageNet 256x256 by scaling transformer-based latent diffusion models, with performance improving consistently as Gflops increase.
OPT: Open Pre-trained Transformer Language Models
cs.CL 2022-05 unverdicted novelty 7.0

OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.
A Simple Framework for Contrastive Learning of Visual Representations
cs.LG 2020-02 accept novelty 7.0

SimCLR learns visual representations by contrasting augmented views of the same image and reaches 76.5% ImageNet top-1 accuracy with a linear classifier, matching a supervised ResNet-50.
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
cs.CL 2019-09 unverdicted novelty 7.0

Intra-layer model parallelism in PyTorch enables training of 8.3B-parameter transformers, achieving SOTA perplexity of 10.8 on WikiText103 and 66.5% accuracy on LAMBADA.
Rescaled Asynchronous SGD: Optimal Distributed Optimization under Data and System Heterogeneity
cs.LG 2026-05 unverdicted novelty 6.0

Rescaled ASGD recovers convergence to the true global objective by rescaling worker stepsizes proportional to computation times, matching the known time lower bound in the leading term under non-convex smoothness and ...
Hypernetworks for Dynamic Feature Selection
cs.LG 2026-05 unverdicted novelty 6.0

Hyper-DFS uses hypernetworks and Set Transformers to generate on-demand parameters for feature subsets in dynamic selection, outperforming prior methods on tabular data and showing stronger zero-shot generalization.
OrScale: Orthogonalised Optimization with Layer-Wise Trust-Ratio Scaling
cs.LG 2026-05 unverdicted novelty 6.0

OrScale adds a Frobenius-norm trust-ratio layer-wise scaler to Muon’s orthogonalized updates, with per-layer calibration for language models, yielding higher CIFAR-10 accuracy and better language-model pre-training lo...
Parameter-Efficient Adaptation of Pre-Trained Vision Foundation Models for Active and Passive Seismic Data Denoising
physics.geo-ph 2026-04 conditional novelty 6.0

Adapting vision foundation models with LoRA and kurtosis-guided unsupervised test-time adaptation matches or exceeds domain-specific models for seismic denoising across multiple sites and unseen data.
Dynamic Cluster Data Sampling for Efficient and Long-Tail-Aware Vision-Language Pre-training
cs.CV 2026-04 unverdicted novelty 6.0

DynamiCS dynamically scales semantic clusters per training epoch to reduce VLM pre-training compute while improving accuracy on long-tail concepts compared to static or flattening baselines.
COPUS: Co-adaptive Parallelism and Batch Size Selection in Large Language Model Training
cs.DC 2026-04 unverdicted novelty 6.0

COPUS co-adapts batch size and parallelism during LLM training via goodput to deliver 3.9-8% average faster convergence than fixing one while tuning the other.
CommFuse: Hiding Tail Latency via Communication Decomposition and Fusion for Distributed LLM Training
cs.LG 2026-04 unverdicted novelty 6.0

CommFuse eliminates tail latency in communication-computation overlap for distributed LLM training by decomposing collective operations into P2P communications and fusing them with fine-grained computation scheduling.
DASH-KV: Accelerating Long-Context LLM Inference via Asymmetric KV Cache Hashing
cs.CL 2026-04 unverdicted novelty 6.0

DASH-KV accelerates long-context LLM inference to linear complexity via asymmetric KV cache hashing and mixed-precision retention, matching full attention performance on LongBench.
Back to Basics: Let Denoising Generative Models Denoise
cs.CV 2025-11 unverdicted novelty 6.0

Directly predicting clean data with large-patch pixel Transformers enables strong generative performance in diffusion models where noise prediction fails at high dimensions.
Mean Flows for One-step Generative Modeling
cs.LG 2025-05 unverdicted novelty 6.0

MeanFlow uses a derived identity between average and instantaneous velocities to train one-step flow models, achieving FID 3.43 on ImageNet 256x256 with 1-NFE from scratch.
Vision Transformers Need Registers
cs.CV 2023-09 unverdicted novelty 6.0

Adding register tokens to Vision Transformers eliminates high-norm background artifacts and raises state-of-the-art performance on dense visual prediction tasks.
Language Models (Mostly) Know What They Know
cs.CL 2022-07 unverdicted novelty 6.0

Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.
A General Language Assistant as a Laboratory for Alignment
cs.CL 2021-12 conditional novelty 6.0

Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.
YOLOX: Exceeding YOLO Series in 2021
cs.CV 2021-07 accept novelty 6.0

YOLOX exceeds prior YOLO models by adopting anchor-free detection, decoupled heads, and SimOTA assignment to reach 50.0% AP on COCO for the large variant.
Information theoretic underpinning of self-supervised learning by clustering
cs.LG 2026-05 unverdicted novelty 5.0

SSL clustering is derived as KL-divergence optimization where a teacher-distribution constraint normalizes via inverse cluster priors and simplifies to batch centering by Jensen's inequality.
Multi-Narrow Transformation as a Single-Model Ensemble: Boundary Conditions, Mechanisms, and Failure Modes
cs.LG 2026-05 unverdicted novelty 5.0

Multi-narrow single-model ensembles outperform wide baselines in low-data image classification by learning diverse features but underperform in data-rich settings where training favors few paths.
Probing Routing-Conditional Calibration in Attention-Residual Transformers
cs.CV 2026-05 unverdicted novelty 5.0

Routing summaries and auxiliary features do not provide stable evidence of conditional miscalibration in AR transformers once confidence-matched baselines, capacity controls, and permutation nulls are applied.
Rennala MVR: Improved Time Complexity for Parallel Stochastic Optimization via Momentum-Based Variance Reduction
math.OC 2026-05 unverdicted novelty 5.0

Rennala MVR improves time complexity over Rennala SGD for smooth nonconvex stochastic optimization in heterogeneous parallel systems under a mean-squared smoothness assumption.
Unleashing Scalable Context Parallelism for Foundation Models Pre-Training via FCP
cs.DC 2026-05 unverdicted novelty 5.0

FCP shards sequences at block level with flexible P2P communication and bin-packing to achieve near-linear scaling up to 256 GPUs and 1.13x-2.21x higher attention MFU in foundation model pre-training.
Trust, but Verify: Peeling Low-Bit Transformer Networks for Training Monitoring
cs.LG 2026-05 unverdicted novelty 5.0

A layer-wise peeling framework creates reference bounds to diagnose under-optimized layers in trained decoder-only transformers, including low-bit and quantized versions.
Lightweight Distillation of SAM 3 and DINOv3 for Edge-Deployable Individual-Level Livestock Monitoring and Longitudinal Visual Analytics
cs.CV 2026-04 unverdicted novelty 5.0

Distilled SAM 3 and DINOv3 models deliver near-teacher accuracy in pig tracking (92.29% MOTA, 96.15% IDF1) and behavior classification while achieving 7.77x parameter reduction and fitting on Jetson Orin NX with headroom.
In-context modeling as a retrain-free paradigm for foundation models in computational science
cs.CE 2026-04 unverdicted novelty 5.0

In-Context Modeling lets one trained model generalize across unseen materials, geometries, and conditions in computational physics by treating measurements as context for inference.
A Progressive Training Strategy for Vision-Language Models to Counteract Spatio-Temporal Hallucinations in Embodied Reasoning
cs.AI 2026-04 unverdicted novelty 5.0

A progressive training framework using spatiotemporal chain-of-thought data reduces the forward-backward temporal query performance gap in VLMs from over 70% to 6.53%.
Sampling Parallelism for Fast and Efficient Bayesian Learning
cs.LG 2026-04 unverdicted novelty 5.0

Sampling parallelism distributes Bayesian sample evaluations across GPUs for near-perfect scaling, lower memory use, and faster convergence via per-GPU data augmentations, outperforming pure data parallelism in diversity.
MMCORE: MultiModal COnnection with Representation Aligned Latent Embeddings
cs.CV 2026-04 unverdicted novelty 4.0

MMCORE transfers VLM reasoning into diffusion-based image generation and editing via aligned latent embeddings from learnable queries, outperforming baselines on text-to-image and editing tasks.
DeepSeek LLM: Scaling Open-Source Language Models with Longtermism
cs.CL 2024-01 unverdicted novelty 4.0

DeepSeek LLM 67B exceeds LLaMA-2 70B on code, mathematics and reasoning benchmarks after pre-training on 2 trillion tokens and alignment via SFT and DPO.
There Will Be a Scientific Theory of Deep Learning
stat.ML 2026-04 unverdicted novelty 2.0

A mechanics of the learning process is emerging in deep learning theory, characterized by dynamics, coarse statistics, and falsifiable predictions across idealized settings, limits, laws, hyperparameters, and universa...

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · cited by 40 Pith papers · 1 internal anchor

[1]

Bagga, H

J. Bagga, H. Morsy, and Z. Yao. Opening designs for 6-pack and Wedge 100. https: //code.facebook.com/posts/203733993317833/ opening-designs-for-6-pack-and-wedge-100 , 2016

work page arXiv 2016
[2]

Barnett, L

M. Barnett, L. Shuler, R. van De Geijn, S. Gupta, D. G. Payne, and J. Watts. Interprocessor collective communica- tion library (intercom). In Scalable High-Performance Com- puting Conference, 1994

work page 1994
[3]

L. Bottou. Curiously fast convergence of some stochastic gradient descent algorithms. Unpublished open problem of- fered to the attendance of the SLDS 2009 conference, 2009

work page 2009
[4]

Optimization methods for large-scale machine learning

L. Bottou, F. E. Curtis, and J. Nocedal. Opt. methods for large-scale machine learning. arXiv:1606.04838, 2016

work page arXiv 2016
[5]

J. Chen, X. Pan, R. Monga, S. Bengio, and R. Joze- fowicz. Revisiting Distributed Synchronous SGD. arXiv:1604.00981, 2016

work page arXiv 2016
[6]

Chen and Q

K. Chen and Q. Huo. Scalable training of deep learning ma- chines by incremental block training with intra-block par- allel optimization and blockwise model-update ﬁltering. In ICASSP, 2016

work page 2016
[7]

Collobert, J

R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa. Natural language pro- cessing (almost) from scratch. JMLR, 2011

work page 2011
[8]

Donahue, Y

J. Donahue, Y . Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. Decaf: A deep convolutional acti- vation feature for generic visual recognition. InICML, 2014

work page 2014
[9]

Girshick

R. Girshick. Fast R-CNN. In ICCV, 2015

work page 2015
[10]

Girshick, J

R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich fea- ture hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014

work page 2014
[11]

Gropp, E

W. Gropp, E. Lusk, and A. Skjellum. Using MPI: Portable Parallel Programming with the Message-Passing Interface . MIT Press, Cambridge, MA, 1999

work page 1999
[12]

Gross and M

S. Gross and M. Wilber. Training and investigating Resid- ual Nets. https://github.com/facebook/fb. resnet.torch, 2016

work page 2016
[14]

K. He, G. Gkioxari, P. Doll ´ar, and R. Girshick. Mask R- CNN. arXiv:1703.06870, 2017

work page arXiv 2017
[15]

K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectiﬁers: Surpassing human-level performance on imagenet classiﬁcation. In ICCV, 2015

work page 2015
[16]

K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016

work page 2016
[17]

Hinton, L

G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly, A. Senior, V . Vanhoucke, P. Nguyen, T. N. Sainath, et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups.IEEE Signal Processing Magazine, 2012

work page 2012
[18]

Hubara, M

I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y . Bengio. Quantized neural networks: Training neu- ral networks with low precision weights and activations. arXiv:1510.08560, 2016

work page arXiv 2016
[19]

Ioffe and C

S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015

work page 2015
[20]

N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. T. P. Tang. On large-batch training for deep learning: Gen- eralization gap and sharp minima. ICLR, 2017

work page 2017
[21]

arXiv preprint arXiv:1404.5997 , year=

A. Krizhevsky. One weird trick for parallelizing convolu- tional neural networks. arXiv:1404.5997, 2014

work page arXiv 2014
[22]

Krizhevsky, I

A. Krizhevsky, I. Sutskever, and G. Hinton. ImageNet classi- ﬁcation with deep convolutional neural nets. InNIPS, 2012

work page 2012
[23]

LeCun, B

Y . LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropagation applied to handwritten zip code recognition. Neural compu- tation, 1989

work page 1989
[24]

K. Lee. Introducing Big Basin: Our next-generation AI hardware. https://code.facebook.com/posts/ 1835166200089399/introducing-big-basin, 2017

work page 2017
[25]

M. Li. Scaling Distributed Machine Learning with System and Algorithm Co-design. PhD thesis, Carnegie Mellon Uni- versity, 2017

work page 2017
[26]

T.-Y . Lin, P. Doll´ar, R. Girshick, K. He, B. Hariharan, and S. Belongie. Feature pyramid networks for object detection. In CVPR, 2017

work page 2017
[27]

T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra- manan, P. Doll´ar, and C. L. Zitnick. Microsoft COCO: Com- mon objects in context. In ECCV. 2014

work page 2014
[28]

J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015

work page 2015
[29]

Nesterov.Introductory lectures on convex optimization: A basic course

Y . Nesterov.Introductory lectures on convex optimization: A basic course. Springer, 2004

work page 2004
[30]

Rabenseifner

R. Rabenseifner. Optimization of collective reduction oper- ations. In ICCS. Springer, 2004

work page 2004
[31]

S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: To- wards real-time object detection with region proposal net- works. In NIPS, 2015

work page 2015
[32]

Robbins and S

H. Robbins and S. Monro. A stochastic approximation method. The annals of mathematical statistics, 1951

work page 1951
[33]

Russakovsky, J

O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. IJCV, 2015

work page 2015
[34]

Sermanet, D

P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y . LeCun. Overfeat: Integrated recognition, localization and detection using convolutional networks. In ICLR, 2014

work page 2014
[35]

Simonyan and A

K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015

work page 2015
[36]

Szegedy, W

C. Szegedy, W. Liu, Y . Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V . Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In CVPR, 2015

work page 2015
[37]

Thakur, R

R. Thakur, R. Rabenseifner, and W. Gropp. Optimization of collective comm. operations in MPICH. IJHPCA, 2005

work page 2005
[38]

Y . Wu, M. Schuster, Z. Chen, Q. V . Le, M. Norouzi, W. Macherey, M. Krikun, Y . Cao, Q. Gao, K. Macherey, et al. Google’s neural machine translation system: Bridg- ing the gap between human and machine translation. arXiv:1609.08144, 2016

work page internal anchor Pith review arXiv 2016
[39]

S. Xie, R. Girshick, P. Doll ´ar, Z. Tu, and K. He. Aggregated residual transformations for deep neural networks. In CVPR, 2017

work page 2017
[40]

Xiong, J

W. Xiong, J. Droppo, X. Huang, F. Seide, M. Seltzer, A. Stol- cke, D. Yu, and G. Zweig. The Microsoft 2016 Conversa- tional Speech Recognition System.arXiv:1609.03528, 2016

work page arXiv 2016
[41]

M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional neural networks. In ECCV, 2014. 12

work page 2014