pith. machine review for the scientific record. sign in

arxiv: 1706.02677 · v2 · submitted 2017-06-08 · 💻 cs.CV · cs.DC· cs.LG

Recognition: 2 theorem links

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

Aapo Kyrola, Andrew Tulloch, Kaiming He, Lukasz Wesolowski, Pieter Noordhuis, Piotr Doll\'ar, Priya Goyal, Ross Girshick, Yangqing Jia

Pith reviewed 2026-05-12 07:15 UTC · model grok-4.3

classification 💻 cs.CV cs.DCcs.LG
keywords large minibatch SGDImageNet trainingdistributed trainingResNet-50linear scaling rulewarmup scheduleone-hour trainingsynchronous SGD
0
0 comments X

The pith

ResNet-50 reaches full ImageNet accuracy when trained with 8192-image minibatches on 256 GPUs in one hour.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that large minibatch sizes in distributed synchronous SGD do not reduce final accuracy on ImageNet when early training problems are handled. It applies a linear scaling rule that raises the learning rate in proportion to the minibatch size and adds a short warmup period at the beginning of training. These two changes allow a ResNet-50 network to train with a minibatch of 8192 images across 256 GPUs while matching the accuracy of conventional small-minibatch training. The complete run finishes in one hour and maintains roughly 90 percent scaling efficiency when the number of GPUs grows from 8 to 256. The result removes a practical barrier to training visual recognition models on very large datasets.

Core claim

With a hyper-parameter-free linear scaling rule for the learning rate and a warmup scheme that overcomes early optimization instability, large-minibatch SGD trains ResNet-50 on ImageNet using minibatches of 8192 images on 256 GPUs in one hour while matching the accuracy of small-minibatch training.

What carries the argument

The linear scaling rule that sets the learning rate proportional to minibatch size, combined with a warmup schedule to stabilize the first few epochs of large-minibatch training.

If this is right

  • ImageNet training time for ResNet-50 drops to one hour on 256 GPUs while preserving accuracy.
  • Synchronous distributed SGD scales to 256 GPUs with approximately 90 percent efficiency using only commodity hardware.
  • Visual recognition models can be trained on internet-scale data with high efficiency and no accuracy penalty from large minibatches.
  • Simple, hyper-parameter-free adjustments suffice to keep generalization intact when minibatch size increases.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same linear scaling plus warmup approach may extend to other convolutional architectures and image-classification datasets.
  • Optimization difficulties with large batches appear limited to the initial phase rather than altering the quality of the final learned solution.
  • Even larger minibatches or greater numbers of GPUs could be tested by proportionally extending the warmup length.

Load-bearing premise

The only obstacles to large-minibatch training are early optimization instability and learning-rate magnitude, which can be fixed by linear scaling and warmup without harming final generalization on ImageNet.

What would settle it

Training the same ResNet-50 model with a minibatch size of 8192 but without the warmup schedule or without linear learning-rate scaling would produce lower top-1 accuracy on the ImageNet validation set than the small-minibatch baseline.

read the original abstract

Deep learning thrives with large neural networks and large datasets. However, larger networks and larger datasets result in longer training times that impede research and development progress. Distributed synchronous SGD offers a potential solution to this problem by dividing SGD minibatches over a pool of parallel workers. Yet to make this scheme efficient, the per-worker workload must be large, which implies nontrivial growth in the SGD minibatch size. In this paper, we empirically show that on the ImageNet dataset large minibatches cause optimization difficulties, but when these are addressed the trained networks exhibit good generalization. Specifically, we show no loss of accuracy when training with large minibatch sizes up to 8192 images. To achieve this result, we adopt a hyper-parameter-free linear scaling rule for adjusting learning rates as a function of minibatch size and develop a new warmup scheme that overcomes optimization challenges early in training. With these simple techniques, our Caffe2-based system trains ResNet-50 with a minibatch size of 8192 on 256 GPUs in one hour, while matching small minibatch accuracy. Using commodity hardware, our implementation achieves ~90% scaling efficiency when moving from 8 to 256 GPUs. Our findings enable training visual recognition models on internet-scale data with high efficiency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The paper claims that large minibatch sizes (up to 8192) cause optimization difficulties in SGD for ImageNet training but that these can be addressed via a hyper-parameter-free linear scaling rule for the learning rate combined with a new warmup schedule; with these changes, ResNet-50 reaches equivalent accuracy to small-batch baselines, and a Caffe2 implementation trains it in one hour on 256 GPUs while achieving ~90% scaling efficiency from 8 to 256 GPUs.

Significance. If the empirical result holds, the work provides a practical, simple method to scale synchronous SGD to large batches without accuracy loss on a standard benchmark, directly enabling much faster wall-clock training of visual recognition models and iteration on internet-scale data. Credit is due for the concrete, reproducible Caffe2 system and the scoped, falsifiable claim backed by direct training runs rather than post-hoc fitting.

minor comments (2)
  1. [Abstract] Abstract and results sections: the statement that large-batch training 'matches small minibatch accuracy' would be strengthened by reporting the exact top-1 (and top-5) validation numbers for both the 8192-batch run and the small-batch reference under identical augmentation and evaluation protocols.
  2. [Experimental setup] Methods or experimental setup: explicit confirmation that the small-batch baseline used the same data augmentation, optimizer hyperparameters (apart from the scaled LR), and evaluation protocol is needed to rule out confounding factors in the accuracy comparison.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive review and recommendation to accept the manuscript. We appreciate the recognition of the empirical results, the reproducibility of the Caffe2 implementation, and the practical implications for scaling synchronous SGD.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper's central claim is an empirical observation: ResNet-50 on ImageNet reaches equivalent top-1 accuracy at minibatch size 8192 versus small batches when the learning rate is scaled linearly and a warmup schedule is applied. These rules are used as fixed, hyper-parameter-free multipliers without being fitted or tuned to the reported accuracy numbers. The results derive from direct training runs on the target task rather than any derivation, prediction, or self-citation chain that reduces to the paper's inputs by construction. No load-bearing steps invoke uniqueness theorems, ansatzes smuggled via citation, or renaming of known results.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical validity of the linear scaling rule and the warmup schedule; both are presented as simple, largely hyper-parameter-free fixes whose parameters are not derived from first principles.

free parameters (2)
  • warmup length
    Duration of the initial gradual learning-rate increase is chosen to stabilize early training; exact value not stated in abstract.
  • base learning rate
    Reference learning rate before linear scaling is selected for the small-batch regime.
axioms (1)
  • domain assumption Linear scaling rule maintains equivalent optimization dynamics when batch size increases
    Invoked to justify multiplying the learning rate by the batch-size ratio without further tuning.

pith-pipeline@v0.9.0 · 5558 in / 1354 out tokens · 35125 ms · 2026-05-12T07:15:48.006355+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 40 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Scalable Distributed Stochastic Optimization via Bidirectional Compression: Beyond Pessimistic Limits

    math.OC 2026-05 unverdicted novelty 7.0

    Inkheart SGD and M4 use bidirectional compression to achieve time complexities in distributed SGD that improve with worker count n and surpass prior lower bounds under a necessary structural assumption.

  2. TENNOR: Trustworthy Execution for Neural Networks through Obliviousness and Retrievals

    cs.CR 2026-05 unverdicted novelty 7.0

    TENNOR enables efficient private training of wide neural networks in TEEs by recasting sparsification as doubly oblivious LSH retrievals and introducing MP-WTA to cut hash table memory by 50x while preserving accuracy.

  3. A Provably Robust Multi-Jet Framework applied to Active Flow Control of an Airfoil in Weakly Compressible Flow

    physics.flu-dyn 2026-04 unverdicted novelty 7.0

    A new injective multi-jet framework for RL flow control provides jet-count-independent running cost upper bounds and enables superior coordinated jet strategies, achieving drag suppression beyond symmetric ideals on c...

  4. Multimodal Reasoning with LLM for Encrypted Traffic Interpretation: A Benchmark

    cs.CR 2026-04 unverdicted novelty 7.0

    Creates the BGTD benchmark and mmTraffic architecture to enable explainable multimodal interpretation of encrypted network traffic using LLMs.

  5. Fusion and Alignment Enhancement with Large Language Models for Tail-item Sequential Recommendation

    cs.IR 2026-04 unverdicted novelty 7.0

    FAERec fuses collaborative ID embeddings with LLM semantic embeddings using adaptive gating and dual-level alignment to enhance tail-item sequential recommendations.

  6. Analyzing Reverse Address Translation Overheads in Multi-GPU Scale-Up Pods

    cs.DC 2026-04 unverdicted novelty 7.0

    Simulation study shows cold TLB misses in reverse address translation dominate latency for small collectives in multi-GPU pods, causing up to 1.4x degradation, while larger ones see diminishing returns.

  7. Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach

    cs.LG 2025-02 unverdicted novelty 7.0

    A recurrent-depth architecture enables language models to improve reasoning performance by iterating computation in latent space, achieving gains equivalent to much larger models on benchmarks.

  8. DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

    cs.CL 2024-05 unverdicted novelty 7.0

    DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.

  9. Segment Anything

    cs.CV 2023-04 unverdicted novelty 7.0

    A promptable model trained on 1B masks achieves competitive zero-shot segmentation performance across tasks and is released publicly with its dataset.

  10. Scalable Diffusion Models with Transformers

    cs.CV 2022-12 unverdicted novelty 7.0

    DiTs achieve SOTA FID of 2.27 on ImageNet 256x256 by scaling transformer-based latent diffusion models, with performance improving consistently as Gflops increase.

  11. OPT: Open Pre-trained Transformer Language Models

    cs.CL 2022-05 unverdicted novelty 7.0

    OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.

  12. A Simple Framework for Contrastive Learning of Visual Representations

    cs.LG 2020-02 accept novelty 7.0

    SimCLR learns visual representations by contrasting augmented views of the same image and reaches 76.5% ImageNet top-1 accuracy with a linear classifier, matching a supervised ResNet-50.

  13. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

    cs.CL 2019-09 unverdicted novelty 7.0

    Intra-layer model parallelism in PyTorch enables training of 8.3B-parameter transformers, achieving SOTA perplexity of 10.8 on WikiText103 and 66.5% accuracy on LAMBADA.

  14. Rescaled Asynchronous SGD: Optimal Distributed Optimization under Data and System Heterogeneity

    cs.LG 2026-05 unverdicted novelty 6.0

    Rescaled ASGD recovers convergence to the true global objective by rescaling worker stepsizes proportional to computation times, matching the known time lower bound in the leading term under non-convex smoothness and ...

  15. Hypernetworks for Dynamic Feature Selection

    cs.LG 2026-05 unverdicted novelty 6.0

    Hyper-DFS uses hypernetworks and Set Transformers to generate on-demand parameters for feature subsets in dynamic selection, outperforming prior methods on tabular data and showing stronger zero-shot generalization.

  16. OrScale: Orthogonalised Optimization with Layer-Wise Trust-Ratio Scaling

    cs.LG 2026-05 unverdicted novelty 6.0

    OrScale adds a Frobenius-norm trust-ratio layer-wise scaler to Muon’s orthogonalized updates, with per-layer calibration for language models, yielding higher CIFAR-10 accuracy and better language-model pre-training lo...

  17. Parameter-Efficient Adaptation of Pre-Trained Vision Foundation Models for Active and Passive Seismic Data Denoising

    physics.geo-ph 2026-04 conditional novelty 6.0

    Adapting vision foundation models with LoRA and kurtosis-guided unsupervised test-time adaptation matches or exceeds domain-specific models for seismic denoising across multiple sites and unseen data.

  18. Dynamic Cluster Data Sampling for Efficient and Long-Tail-Aware Vision-Language Pre-training

    cs.CV 2026-04 unverdicted novelty 6.0

    DynamiCS dynamically scales semantic clusters per training epoch to reduce VLM pre-training compute while improving accuracy on long-tail concepts compared to static or flattening baselines.

  19. COPUS: Co-adaptive Parallelism and Batch Size Selection in Large Language Model Training

    cs.DC 2026-04 unverdicted novelty 6.0

    COPUS co-adapts batch size and parallelism during LLM training via goodput to deliver 3.9-8% average faster convergence than fixing one while tuning the other.

  20. CommFuse: Hiding Tail Latency via Communication Decomposition and Fusion for Distributed LLM Training

    cs.LG 2026-04 unverdicted novelty 6.0

    CommFuse eliminates tail latency in communication-computation overlap for distributed LLM training by decomposing collective operations into P2P communications and fusing them with fine-grained computation scheduling.

  21. DASH-KV: Accelerating Long-Context LLM Inference via Asymmetric KV Cache Hashing

    cs.CL 2026-04 unverdicted novelty 6.0

    DASH-KV accelerates long-context LLM inference to linear complexity via asymmetric KV cache hashing and mixed-precision retention, matching full attention performance on LongBench.

  22. Back to Basics: Let Denoising Generative Models Denoise

    cs.CV 2025-11 unverdicted novelty 6.0

    Directly predicting clean data with large-patch pixel Transformers enables strong generative performance in diffusion models where noise prediction fails at high dimensions.

  23. Mean Flows for One-step Generative Modeling

    cs.LG 2025-05 unverdicted novelty 6.0

    MeanFlow uses a derived identity between average and instantaneous velocities to train one-step flow models, achieving FID 3.43 on ImageNet 256x256 with 1-NFE from scratch.

  24. Vision Transformers Need Registers

    cs.CV 2023-09 unverdicted novelty 6.0

    Adding register tokens to Vision Transformers eliminates high-norm background artifacts and raises state-of-the-art performance on dense visual prediction tasks.

  25. Language Models (Mostly) Know What They Know

    cs.CL 2022-07 unverdicted novelty 6.0

    Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.

  26. A General Language Assistant as a Laboratory for Alignment

    cs.CL 2021-12 conditional novelty 6.0

    Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.

  27. YOLOX: Exceeding YOLO Series in 2021

    cs.CV 2021-07 accept novelty 6.0

    YOLOX exceeds prior YOLO models by adopting anchor-free detection, decoupled heads, and SimOTA assignment to reach 50.0% AP on COCO for the large variant.

  28. Information theoretic underpinning of self-supervised learning by clustering

    cs.LG 2026-05 unverdicted novelty 5.0

    SSL clustering is derived as KL-divergence optimization where a teacher-distribution constraint normalizes via inverse cluster priors and simplifies to batch centering by Jensen's inequality.

  29. Multi-Narrow Transformation as a Single-Model Ensemble: Boundary Conditions, Mechanisms, and Failure Modes

    cs.LG 2026-05 unverdicted novelty 5.0

    Multi-narrow single-model ensembles outperform wide baselines in low-data image classification by learning diverse features but underperform in data-rich settings where training favors few paths.

  30. Probing Routing-Conditional Calibration in Attention-Residual Transformers

    cs.CV 2026-05 unverdicted novelty 5.0

    Routing summaries and auxiliary features do not provide stable evidence of conditional miscalibration in AR transformers once confidence-matched baselines, capacity controls, and permutation nulls are applied.

  31. Rennala MVR: Improved Time Complexity for Parallel Stochastic Optimization via Momentum-Based Variance Reduction

    math.OC 2026-05 unverdicted novelty 5.0

    Rennala MVR improves time complexity over Rennala SGD for smooth nonconvex stochastic optimization in heterogeneous parallel systems under a mean-squared smoothness assumption.

  32. Unleashing Scalable Context Parallelism for Foundation Models Pre-Training via FCP

    cs.DC 2026-05 unverdicted novelty 5.0

    FCP shards sequences at block level with flexible P2P communication and bin-packing to achieve near-linear scaling up to 256 GPUs and 1.13x-2.21x higher attention MFU in foundation model pre-training.

  33. Trust, but Verify: Peeling Low-Bit Transformer Networks for Training Monitoring

    cs.LG 2026-05 unverdicted novelty 5.0

    A layer-wise peeling framework creates reference bounds to diagnose under-optimized layers in trained decoder-only transformers, including low-bit and quantized versions.

  34. Lightweight Distillation of SAM 3 and DINOv3 for Edge-Deployable Individual-Level Livestock Monitoring and Longitudinal Visual Analytics

    cs.CV 2026-04 unverdicted novelty 5.0

    Distilled SAM 3 and DINOv3 models deliver near-teacher accuracy in pig tracking (92.29% MOTA, 96.15% IDF1) and behavior classification while achieving 7.77x parameter reduction and fitting on Jetson Orin NX with headroom.

  35. In-context modeling as a retrain-free paradigm for foundation models in computational science

    cs.CE 2026-04 unverdicted novelty 5.0

    In-Context Modeling lets one trained model generalize across unseen materials, geometries, and conditions in computational physics by treating measurements as context for inference.

  36. A Progressive Training Strategy for Vision-Language Models to Counteract Spatio-Temporal Hallucinations in Embodied Reasoning

    cs.AI 2026-04 unverdicted novelty 5.0

    A progressive training framework using spatiotemporal chain-of-thought data reduces the forward-backward temporal query performance gap in VLMs from over 70% to 6.53%.

  37. Sampling Parallelism for Fast and Efficient Bayesian Learning

    cs.LG 2026-04 unverdicted novelty 5.0

    Sampling parallelism distributes Bayesian sample evaluations across GPUs for near-perfect scaling, lower memory use, and faster convergence via per-GPU data augmentations, outperforming pure data parallelism in diversity.

  38. MMCORE: MultiModal COnnection with Representation Aligned Latent Embeddings

    cs.CV 2026-04 unverdicted novelty 4.0

    MMCORE transfers VLM reasoning into diffusion-based image generation and editing via aligned latent embeddings from learnable queries, outperforming baselines on text-to-image and editing tasks.

  39. DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

    cs.CL 2024-01 unverdicted novelty 4.0

    DeepSeek LLM 67B exceeds LLaMA-2 70B on code, mathematics and reasoning benchmarks after pre-training on 2 trillion tokens and alignment via SFT and DPO.

  40. There Will Be a Scientific Theory of Deep Learning

    stat.ML 2026-04 unverdicted novelty 2.0

    A mechanics of the learning process is emerging in deep learning theory, characterized by dynamics, coarse statistics, and falsifiable predictions across idealized settings, limits, laws, hyperparameters, and universa...

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · cited by 40 Pith papers · 1 internal anchor

  1. [1]

    Bagga, H

    J. Bagga, H. Morsy, and Z. Yao. Opening designs for 6-pack and Wedge 100. https: //code.facebook.com/posts/203733993317833/ opening-designs-for-6-pack-and-wedge-100 , 2016

  2. [2]

    Barnett, L

    M. Barnett, L. Shuler, R. van De Geijn, S. Gupta, D. G. Payne, and J. Watts. Interprocessor collective communica- tion library (intercom). In Scalable High-Performance Com- puting Conference, 1994

  3. [3]

    L. Bottou. Curiously fast convergence of some stochastic gradient descent algorithms. Unpublished open problem of- fered to the attendance of the SLDS 2009 conference, 2009

  4. [4]

    Optimization methods for large-scale machine learning

    L. Bottou, F. E. Curtis, and J. Nocedal. Opt. methods for large-scale machine learning. arXiv:1606.04838, 2016

  5. [5]

    J. Chen, X. Pan, R. Monga, S. Bengio, and R. Joze- fowicz. Revisiting Distributed Synchronous SGD. arXiv:1604.00981, 2016

  6. [6]

    Chen and Q

    K. Chen and Q. Huo. Scalable training of deep learning ma- chines by incremental block training with intra-block par- allel optimization and blockwise model-update filtering. In ICASSP, 2016

  7. [7]

    Collobert, J

    R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa. Natural language pro- cessing (almost) from scratch. JMLR, 2011

  8. [8]

    Donahue, Y

    J. Donahue, Y . Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. Decaf: A deep convolutional acti- vation feature for generic visual recognition. InICML, 2014

  9. [9]

    Girshick

    R. Girshick. Fast R-CNN. In ICCV, 2015

  10. [10]

    Girshick, J

    R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich fea- ture hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014

  11. [11]

    Gropp, E

    W. Gropp, E. Lusk, and A. Skjellum. Using MPI: Portable Parallel Programming with the Message-Passing Interface . MIT Press, Cambridge, MA, 1999

  12. [12]

    Gross and M

    S. Gross and M. Wilber. Training and investigating Resid- ual Nets. https://github.com/facebook/fb. resnet.torch, 2016

  13. [14]

    K. He, G. Gkioxari, P. Doll ´ar, and R. Girshick. Mask R- CNN. arXiv:1703.06870, 2017

  14. [15]

    K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In ICCV, 2015

  15. [16]

    K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016

  16. [17]

    Hinton, L

    G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly, A. Senior, V . Vanhoucke, P. Nguyen, T. N. Sainath, et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups.IEEE Signal Processing Magazine, 2012

  17. [18]

    Hubara, M

    I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y . Bengio. Quantized neural networks: Training neu- ral networks with low precision weights and activations. arXiv:1510.08560, 2016

  18. [19]

    Ioffe and C

    S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015

  19. [20]

    N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. T. P. Tang. On large-batch training for deep learning: Gen- eralization gap and sharp minima. ICLR, 2017

  20. [21]

    arXiv preprint arXiv:1404.5997 , year=

    A. Krizhevsky. One weird trick for parallelizing convolu- tional neural networks. arXiv:1404.5997, 2014

  21. [22]

    Krizhevsky, I

    A. Krizhevsky, I. Sutskever, and G. Hinton. ImageNet classi- fication with deep convolutional neural nets. InNIPS, 2012

  22. [23]

    LeCun, B

    Y . LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropagation applied to handwritten zip code recognition. Neural compu- tation, 1989

  23. [24]

    K. Lee. Introducing Big Basin: Our next-generation AI hardware. https://code.facebook.com/posts/ 1835166200089399/introducing-big-basin, 2017

  24. [25]

    M. Li. Scaling Distributed Machine Learning with System and Algorithm Co-design. PhD thesis, Carnegie Mellon Uni- versity, 2017

  25. [26]

    T.-Y . Lin, P. Doll´ar, R. Girshick, K. He, B. Hariharan, and S. Belongie. Feature pyramid networks for object detection. In CVPR, 2017

  26. [27]

    T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra- manan, P. Doll´ar, and C. L. Zitnick. Microsoft COCO: Com- mon objects in context. In ECCV. 2014

  27. [28]

    J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015

  28. [29]

    Nesterov.Introductory lectures on convex optimization: A basic course

    Y . Nesterov.Introductory lectures on convex optimization: A basic course. Springer, 2004

  29. [30]

    Rabenseifner

    R. Rabenseifner. Optimization of collective reduction oper- ations. In ICCS. Springer, 2004

  30. [31]

    S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: To- wards real-time object detection with region proposal net- works. In NIPS, 2015

  31. [32]

    Robbins and S

    H. Robbins and S. Monro. A stochastic approximation method. The annals of mathematical statistics, 1951

  32. [33]

    Russakovsky, J

    O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. IJCV, 2015

  33. [34]

    Sermanet, D

    P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y . LeCun. Overfeat: Integrated recognition, localization and detection using convolutional networks. In ICLR, 2014

  34. [35]

    Simonyan and A

    K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015

  35. [36]

    Szegedy, W

    C. Szegedy, W. Liu, Y . Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V . Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In CVPR, 2015

  36. [37]

    Thakur, R

    R. Thakur, R. Rabenseifner, and W. Gropp. Optimization of collective comm. operations in MPICH. IJHPCA, 2005

  37. [38]

    Y . Wu, M. Schuster, Z. Chen, Q. V . Le, M. Norouzi, W. Macherey, M. Krikun, Y . Cao, Q. Gao, K. Macherey, et al. Google’s neural machine translation system: Bridg- ing the gap between human and machine translation. arXiv:1609.08144, 2016

  38. [39]

    S. Xie, R. Girshick, P. Doll ´ar, Z. Tu, and K. He. Aggregated residual transformations for deep neural networks. In CVPR, 2017

  39. [40]

    Xiong, J

    W. Xiong, J. Droppo, X. Huang, F. Seide, M. Seltzer, A. Stol- cke, D. Yu, and G. Zweig. The Microsoft 2016 Conversa- tional Speech Recognition System.arXiv:1609.03528, 2016

  40. [41]

    M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional neural networks. In ECCV, 2014. 12