hub

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

· 2017 · cs.CV · arXiv 1706.02677

40 Pith papers cite this work. Polarity classification is still indexing.

40 Pith papers citing it

open full Pith review browse 40 citing papers arXiv PDF

abstract

Deep learning thrives with large neural networks and large datasets. However, larger networks and larger datasets result in longer training times that impede research and development progress. Distributed synchronous SGD offers a potential solution to this problem by dividing SGD minibatches over a pool of parallel workers. Yet to make this scheme efficient, the per-worker workload must be large, which implies nontrivial growth in the SGD minibatch size. In this paper, we empirically show that on the ImageNet dataset large minibatches cause optimization difficulties, but when these are addressed the trained networks exhibit good generalization. Specifically, we show no loss of accuracy when training with large minibatch sizes up to 8192 images. To achieve this result, we adopt a hyper-parameter-free linear scaling rule for adjusting learning rates as a function of minibatch size and develop a new warmup scheme that overcomes optimization challenges early in training. With these simple techniques, our Caffe2-based system trains ResNet-50 with a minibatch size of 8192 on 256 GPUs in one hour, while matching small minibatch accuracy. Using commodity hardware, our implementation achieves ~90% scaling efficiency when moving from 8 to 256 GPUs. Our findings enable training visual recognition models on internet-scale data with high efficiency.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 1

citation-polarity summary

background 1

claims ledger

abstract Deep learning thrives with large neural networks and large datasets. However, larger networks and larger datasets result in longer training times that impede research and development progress. Distributed synchronous SGD offers a potential solution to this problem by dividing SGD minibatches over a pool of parallel workers. Yet to make this scheme efficient, the per-worker workload must be large, which implies nontrivial growth in the SGD minibatch size. In this paper, we empirically show that on the ImageNet dataset large minibatches cause optimization difficulties, but when these are address

co-cited works

representative citing papers

Scalable Distributed Stochastic Optimization via Bidirectional Compression: Beyond Pessimistic Limits

math.OC · 2026-05-08 · unverdicted · novelty 7.0

Inkheart SGD and M4 use bidirectional compression to achieve time complexities in distributed SGD that improve with worker count n and surpass prior lower bounds under a necessary structural assumption.

TENNOR: Trustworthy Execution for Neural Networks through Obliviousness and Retrievals

cs.CR · 2026-05-08 · unverdicted · novelty 7.0

TENNOR enables efficient private training of wide neural networks in TEEs by recasting sparsification as doubly oblivious LSH retrievals and introducing MP-WTA to cut hash table memory by 50x while preserving accuracy.

A Provably Robust Multi-Jet Framework applied to Active Flow Control of an Airfoil in Weakly Compressible Flow

physics.flu-dyn · 2026-04-29 · unverdicted · novelty 7.0

A new injective multi-jet framework for RL flow control provides jet-count-independent running cost upper bounds and enables superior coordinated jet strategies, achieving drag suppression beyond symmetric ideals on cylinders and aerodynamic efficiency gains from 53% to 73% on airfoils.

Multimodal Reasoning with LLM for Encrypted Traffic Interpretation: A Benchmark

cs.CR · 2026-04-09 · unverdicted · novelty 7.0

Creates the BGTD benchmark and mmTraffic architecture to enable explainable multimodal interpretation of encrypted network traffic using LLMs.

Fusion and Alignment Enhancement with Large Language Models for Tail-item Sequential Recommendation

cs.IR · 2026-04-04 · unverdicted · novelty 7.0

FAERec fuses collaborative ID embeddings with LLM semantic embeddings using adaptive gating and dual-level alignment to enhance tail-item sequential recommendations.

Analyzing Reverse Address Translation Overheads in Multi-GPU Scale-Up Pods

cs.DC · 2026-04-02 · unverdicted · novelty 7.0

Simulation study shows cold TLB misses in reverse address translation dominate latency for small collectives in multi-GPU pods, causing up to 1.4x degradation, while larger ones see diminishing returns.

Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach

cs.LG · 2025-02-07 · unverdicted · novelty 7.0

A recurrent-depth architecture enables language models to improve reasoning performance by iterating computation in latent space, achieving gains equivalent to much larger models on benchmarks.

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

cs.CL · 2024-05-07 · unverdicted · novelty 7.0

DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.

Segment Anything

cs.CV · 2023-04-05 · unverdicted · novelty 7.0

A promptable model trained on 1B masks achieves competitive zero-shot segmentation performance across tasks and is released publicly with its dataset.

Scalable Diffusion Models with Transformers

cs.CV · 2022-12-19 · unverdicted · novelty 7.0

DiTs achieve SOTA FID of 2.27 on ImageNet 256x256 by scaling transformer-based latent diffusion models, with performance improving consistently as Gflops increase.

OPT: Open Pre-trained Transformer Language Models

cs.CL · 2022-05-02 · unverdicted · novelty 7.0

OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.

A Simple Framework for Contrastive Learning of Visual Representations

cs.LG · 2020-02-13 · accept · novelty 7.0

SimCLR learns visual representations by contrasting augmented views of the same image and reaches 76.5% ImageNet top-1 accuracy with a linear classifier, matching a supervised ResNet-50.

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

cs.CL · 2019-09-17 · unverdicted · novelty 7.0

Intra-layer model parallelism in PyTorch enables training of 8.3B-parameter transformers, achieving SOTA perplexity of 10.8 on WikiText103 and 66.5% accuracy on LAMBADA.

Rescaled Asynchronous SGD: Optimal Distributed Optimization under Data and System Heterogeneity

cs.LG · 2026-05-13 · unverdicted · novelty 6.0

Rescaled ASGD recovers convergence to the true global objective by rescaling worker stepsizes proportional to computation times, matching the known time lower bound in the leading term under non-convex smoothness and bounded heterogeneity.

Hypernetworks for Dynamic Feature Selection

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

Hyper-DFS uses hypernetworks and Set Transformers to generate on-demand parameters for feature subsets in dynamic selection, outperforming prior methods on tabular data and showing stronger zero-shot generalization.

OrScale: Orthogonalised Optimization with Layer-Wise Trust-Ratio Scaling

cs.LG · 2026-05-08 · unverdicted · novelty 6.0

OrScale adds a Frobenius-norm trust-ratio layer-wise scaler to Muon’s orthogonalized updates, with per-layer calibration for language models, yielding higher CIFAR-10 accuracy and better language-model pre-training loss than Muon+Moonlight and AdamW.

Parameter-Efficient Adaptation of Pre-Trained Vision Foundation Models for Active and Passive Seismic Data Denoising

physics.geo-ph · 2026-04-30 · conditional · novelty 6.0

Adapting vision foundation models with LoRA and kurtosis-guided unsupervised test-time adaptation matches or exceeds domain-specific models for seismic denoising across multiple sites and unseen data.

Dynamic Cluster Data Sampling for Efficient and Long-Tail-Aware Vision-Language Pre-training

cs.CV · 2026-04-30 · unverdicted · novelty 6.0

DynamiCS dynamically scales semantic clusters per training epoch to reduce VLM pre-training compute while improving accuracy on long-tail concepts compared to static or flattening baselines.

COPUS: Co-adaptive Parallelism and Batch Size Selection in Large Language Model Training

cs.DC · 2026-04-29 · unverdicted · novelty 6.0

COPUS co-adapts batch size and parallelism during LLM training via goodput to deliver 3.9-8% average faster convergence than fixing one while tuning the other.

CommFuse: Hiding Tail Latency via Communication Decomposition and Fusion for Distributed LLM Training

cs.LG · 2026-04-27 · unverdicted · novelty 6.0

CommFuse eliminates tail latency in communication-computation overlap for distributed LLM training by decomposing collective operations into P2P communications and fusing them with fine-grained computation scheduling.

DASH-KV: Accelerating Long-Context LLM Inference via Asymmetric KV Cache Hashing

cs.CL · 2026-04-21 · unverdicted · novelty 6.0

DASH-KV accelerates long-context LLM inference to linear complexity via asymmetric KV cache hashing and mixed-precision retention, matching full attention performance on LongBench.

Back to Basics: Let Denoising Generative Models Denoise

cs.CV · 2025-11-17 · unverdicted · novelty 6.0

Directly predicting clean data with large-patch pixel Transformers enables strong generative performance in diffusion models where noise prediction fails at high dimensions.

Mean Flows for One-step Generative Modeling

cs.LG · 2025-05-19 · unverdicted · novelty 6.0

MeanFlow uses a derived identity between average and instantaneous velocities to train one-step flow models, achieving FID 3.43 on ImageNet 256x256 with 1-NFE from scratch.

Vision Transformers Need Registers

cs.CV · 2023-09-28 · unverdicted · novelty 6.0

Adding register tokens to Vision Transformers eliminates high-norm background artifacts and raises state-of-the-art performance on dense visual prediction tasks.

citing papers explorer

Showing 40 of 40 citing papers.

Scalable Distributed Stochastic Optimization via Bidirectional Compression: Beyond Pessimistic Limits math.OC · 2026-05-08 · unverdicted · none · ref 110 · internal anchor
Inkheart SGD and M4 use bidirectional compression to achieve time complexities in distributed SGD that improve with worker count n and surpass prior lower bounds under a necessary structural assumption.
TENNOR: Trustworthy Execution for Neural Networks through Obliviousness and Retrievals cs.CR · 2026-05-08 · unverdicted · none · ref 40 · internal anchor
TENNOR enables efficient private training of wide neural networks in TEEs by recasting sparsification as doubly oblivious LSH retrievals and introducing MP-WTA to cut hash table memory by 50x while preserving accuracy.
A Provably Robust Multi-Jet Framework applied to Active Flow Control of an Airfoil in Weakly Compressible Flow physics.flu-dyn · 2026-04-29 · unverdicted · none · ref 61 · internal anchor
A new injective multi-jet framework for RL flow control provides jet-count-independent running cost upper bounds and enables superior coordinated jet strategies, achieving drag suppression beyond symmetric ideals on cylinders and aerodynamic efficiency gains from 53% to 73% on airfoils.
Multimodal Reasoning with LLM for Encrypted Traffic Interpretation: A Benchmark cs.CR · 2026-04-09 · unverdicted · none · ref 12 · internal anchor
Creates the BGTD benchmark and mmTraffic architecture to enable explainable multimodal interpretation of encrypted network traffic using LLMs.
Fusion and Alignment Enhancement with Large Language Models for Tail-item Sequential Recommendation cs.IR · 2026-04-04 · unverdicted · none · ref 16 · internal anchor
FAERec fuses collaborative ID embeddings with LLM semantic embeddings using adaptive gating and dual-level alignment to enhance tail-item sequential recommendations.
Analyzing Reverse Address Translation Overheads in Multi-GPU Scale-Up Pods cs.DC · 2026-04-02 · unverdicted · none · ref 44 · internal anchor
Simulation study shows cold TLB misses in reverse address translation dominate latency for small collectives in multi-GPU pods, causing up to 1.4x degradation, while larger ones see diminishing returns.
Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach cs.LG · 2025-02-07 · unverdicted · none · ref 63 · internal anchor
A recurrent-depth architecture enables language models to improve reasoning performance by iterating computation in latent space, achieving gains equivalent to much larger models on benchmarks.
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cs.CL · 2024-05-07 · unverdicted · none · ref 158 · internal anchor
DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.
Segment Anything cs.CV · 2023-04-05 · unverdicted · none · ref 42 · internal anchor
A promptable model trained on 1B masks achieves competitive zero-shot segmentation performance across tasks and is released publicly with its dataset.
Scalable Diffusion Models with Transformers cs.CV · 2022-12-19 · unverdicted · none · ref 13 · internal anchor
DiTs achieve SOTA FID of 2.27 on ImageNet 256x256 by scaling transformer-based latent diffusion models, with performance improving consistently as Gflops increase.
OPT: Open Pre-trained Transformer Language Models cs.CL · 2022-05-02 · unverdicted · none · ref 89 · internal anchor
OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.
A Simple Framework for Contrastive Learning of Visual Representations cs.LG · 2020-02-13 · accept · none · ref 21 · internal anchor
SimCLR learns visual representations by contrasting augmented views of the same image and reaches 76.5% ImageNet top-1 accuracy with a linear classifier, matching a supervised ResNet-50.
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism cs.CL · 2019-09-17 · unverdicted · none · ref 6 · internal anchor
Intra-layer model parallelism in PyTorch enables training of 8.3B-parameter transformers, achieving SOTA perplexity of 10.8 on WikiText103 and 66.5% accuracy on LAMBADA.
Rescaled Asynchronous SGD: Optimal Distributed Optimization under Data and System Heterogeneity cs.LG · 2026-05-13 · unverdicted · none · ref 169 · internal anchor
Rescaled ASGD recovers convergence to the true global objective by rescaling worker stepsizes proportional to computation times, matching the known time lower bound in the leading term under non-convex smoothness and bounded heterogeneity.
Hypernetworks for Dynamic Feature Selection cs.LG · 2026-05-12 · unverdicted · none · ref 19 · internal anchor
Hyper-DFS uses hypernetworks and Set Transformers to generate on-demand parameters for feature subsets in dynamic selection, outperforming prior methods on tabular data and showing stronger zero-shot generalization.
OrScale: Orthogonalised Optimization with Layer-Wise Trust-Ratio Scaling cs.LG · 2026-05-08 · unverdicted · none · ref 5 · internal anchor
OrScale adds a Frobenius-norm trust-ratio layer-wise scaler to Muon’s orthogonalized updates, with per-layer calibration for language models, yielding higher CIFAR-10 accuracy and better language-model pre-training loss than Muon+Moonlight and AdamW.
Parameter-Efficient Adaptation of Pre-Trained Vision Foundation Models for Active and Passive Seismic Data Denoising physics.geo-ph · 2026-04-30 · conditional · none · ref 49 · internal anchor
Adapting vision foundation models with LoRA and kurtosis-guided unsupervised test-time adaptation matches or exceeds domain-specific models for seismic denoising across multiple sites and unseen data.
Dynamic Cluster Data Sampling for Efficient and Long-Tail-Aware Vision-Language Pre-training cs.CV · 2026-04-30 · unverdicted · none · ref 48 · internal anchor
DynamiCS dynamically scales semantic clusters per training epoch to reduce VLM pre-training compute while improving accuracy on long-tail concepts compared to static or flattening baselines.
COPUS: Co-adaptive Parallelism and Batch Size Selection in Large Language Model Training cs.DC · 2026-04-29 · unverdicted · none · ref 9 · internal anchor
COPUS co-adapts batch size and parallelism during LLM training via goodput to deliver 3.9-8% average faster convergence than fixing one while tuning the other.
CommFuse: Hiding Tail Latency via Communication Decomposition and Fusion for Distributed LLM Training cs.LG · 2026-04-27 · unverdicted · none · ref 8 · internal anchor
CommFuse eliminates tail latency in communication-computation overlap for distributed LLM training by decomposing collective operations into P2P communications and fusing them with fine-grained computation scheduling.
DASH-KV: Accelerating Long-Context LLM Inference via Asymmetric KV Cache Hashing cs.CL · 2026-04-21 · unverdicted · none · ref 41 · internal anchor
DASH-KV accelerates long-context LLM inference to linear complexity via asymmetric KV cache hashing and mixed-precision retention, matching full attention performance on LongBench.
Back to Basics: Let Denoising Generative Models Denoise cs.CV · 2025-11-17 · unverdicted · none · ref 17 · internal anchor
Directly predicting clean data with large-patch pixel Transformers enables strong generative performance in diffusion models where noise prediction fails at high dimensions.
Mean Flows for One-step Generative Modeling cs.LG · 2025-05-19 · unverdicted · none · ref 16 · internal anchor
MeanFlow uses a derived identity between average and instantaneous velocities to train one-step flow models, achieving FID 3.43 on ImageNet 256x256 with 1-NFE from scratch.
Vision Transformers Need Registers cs.CV · 2023-09-28 · unverdicted · none · ref 105 · internal anchor
Adding register tokens to Vision Transformers eliminates high-norm background artifacts and raises state-of-the-art performance on dense visual prediction tasks.
Language Models (Mostly) Know What They Know cs.CL · 2022-07-11 · unverdicted · none · ref 277 · internal anchor
Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.
A General Language Assistant as a Laboratory for Alignment cs.CL · 2021-12-01 · conditional · none · ref 199 · internal anchor
Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.
YOLOX: Exceeding YOLO Series in 2021 cs.CV · 2021-07-18 · accept · none · ref 8 · internal anchor
YOLOX exceeds prior YOLO models by adopting anchor-free detection, decoupled heads, and SimOTA assignment to reach 50.0% AP on COCO for the large variant.
Information theoretic underpinning of self-supervised learning by clustering cs.LG · 2026-05-12 · unverdicted · none · ref 156 · internal anchor
SSL clustering is derived as KL-divergence optimization where a teacher-distribution constraint normalizes via inverse cluster priors and simplifies to batch centering by Jensen's inequality.
Multi-Narrow Transformation as a Single-Model Ensemble: Boundary Conditions, Mechanisms, and Failure Modes cs.LG · 2026-05-12 · unverdicted · none · ref 7 · internal anchor
Multi-narrow single-model ensembles outperform wide baselines in low-data image classification by learning diverse features but underperform in data-rich settings where training favors few paths.
Probing Routing-Conditional Calibration in Attention-Residual Transformers cs.CV · 2026-05-11 · unverdicted · none · ref 2 · internal anchor
Routing summaries and auxiliary features do not provide stable evidence of conditional miscalibration in AR transformers once confidence-matched baselines, capacity controls, and permutation nulls are applied.
Rennala MVR: Improved Time Complexity for Parallel Stochastic Optimization via Momentum-Based Variance Reduction math.OC · 2026-05-09 · unverdicted · none · ref 47 · internal anchor
Rennala MVR improves time complexity over Rennala SGD for smooth nonconvex stochastic optimization in heterogeneous parallel systems under a mean-squared smoothness assumption.
Unleashing Scalable Context Parallelism for Foundation Models Pre-Training via FCP cs.DC · 2026-05-08 · unverdicted · none · ref 44 · internal anchor
FCP shards sequences at block level with flexible P2P communication and bin-packing to achieve near-linear scaling up to 256 GPUs and 1.13x-2.21x higher attention MFU in foundation model pre-training.
Trust, but Verify: Peeling Low-Bit Transformer Networks for Training Monitoring cs.LG · 2026-05-04 · unverdicted · none · ref 17 · internal anchor
A layer-wise peeling framework creates reference bounds to diagnose under-optimized layers in trained decoder-only transformers, including low-bit and quantized versions.
Lightweight Distillation of SAM 3 and DINOv3 for Edge-Deployable Individual-Level Livestock Monitoring and Longitudinal Visual Analytics cs.CV · 2026-04-29 · unverdicted · none · ref 3 · internal anchor
Distilled SAM 3 and DINOv3 models deliver near-teacher accuracy in pig tracking (92.29% MOTA, 96.15% IDF1) and behavior classification while achieving 7.77x parameter reduction and fitting on Jetson Orin NX with headroom.
In-context modeling as a retrain-free paradigm for foundation models in computational science cs.CE · 2026-04-25 · unverdicted · none · ref 39 · internal anchor
In-Context Modeling lets one trained model generalize across unseen materials, geometries, and conditions in computational physics by treating measurements as context for inference.
A Progressive Training Strategy for Vision-Language Models to Counteract Spatio-Temporal Hallucinations in Embodied Reasoning cs.AI · 2026-04-12 · unverdicted · none · ref 3 · internal anchor
A progressive training framework using spatiotemporal chain-of-thought data reduces the forward-backward temporal query performance gap in VLMs from over 70% to 6.53%.
Sampling Parallelism for Fast and Efficient Bayesian Learning cs.LG · 2026-04-06 · unverdicted · none · ref 14 · internal anchor
Sampling parallelism distributes Bayesian sample evaluations across GPUs for near-perfect scaling, lower memory use, and faster convergence via per-GPU data augmentations, outperforming pure data parallelism in diversity.
MMCORE: MultiModal COnnection with Representation Aligned Latent Embeddings cs.CV · 2026-04-21 · unverdicted · none · ref 8 · internal anchor
MMCORE transfers VLM reasoning into diffusion-based image generation and editing via aligned latent embeddings from learnable queries, outperforming baselines on text-to-image and editing tasks.
DeepSeek LLM: Scaling Open-Source Language Models with Longtermism cs.CL · 2024-01-05 · unverdicted · none · ref 160 · internal anchor
DeepSeek LLM 67B exceeds LLaMA-2 70B on code, mathematics and reasoning benchmarks after pre-training on 2 trillion tokens and alignment via SFT and DPO.
There Will Be a Scientific Theory of Deep Learning stat.ML · 2026-04-23 · unverdicted · none · ref 266 · internal anchor
A mechanics of the learning process is emerging in deep learning theory, characterized by dynamics, coarse statistics, and falsifiable predictions across idealized settings, limits, laws, hyperparameters, and universal behaviors.

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

hub tools

citation-role summary

citation-polarity summary

claims ledger

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer