hub Canonical reference

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

· 2017 · cs.CV · arXiv 1706.02677

Canonical reference. 82% of citing Pith papers cite this work as background.

99 Pith papers citing it

Background 82% of classified citations

open full Pith review browse 99 citing papers arXiv PDF

abstract

Deep learning thrives with large neural networks and large datasets. However, larger networks and larger datasets result in longer training times that impede research and development progress. Distributed synchronous SGD offers a potential solution to this problem by dividing SGD minibatches over a pool of parallel workers. Yet to make this scheme efficient, the per-worker workload must be large, which implies nontrivial growth in the SGD minibatch size. In this paper, we empirically show that on the ImageNet dataset large minibatches cause optimization difficulties, but when these are addressed the trained networks exhibit good generalization. Specifically, we show no loss of accuracy when training with large minibatch sizes up to 8192 images. To achieve this result, we adopt a hyper-parameter-free linear scaling rule for adjusting learning rates as a function of minibatch size and develop a new warmup scheme that overcomes optimization challenges early in training. With these simple techniques, our Caffe2-based system trains ResNet-50 with a minibatch size of 8192 on 256 GPUs in one hour, while matching small minibatch accuracy. Using commodity hardware, our implementation achieves ~90% scaling efficiency when moving from 8 to 256 GPUs. Our findings enable training visual recognition models on internet-scale data with high efficiency.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 15 method 2

citation-polarity summary

background 14 use method 2 support 1

claims ledger

abstract Deep learning thrives with large neural networks and large datasets. However, larger networks and larger datasets result in longer training times that impede research and development progress. Distributed synchronous SGD offers a potential solution to this problem by dividing SGD minibatches over a pool of parallel workers. Yet to make this scheme efficient, the per-worker workload must be large, which implies nontrivial growth in the SGD minibatch size. In this paper, we empirically show that on the ImageNet dataset large minibatches cause optimization difficulties, but when these are address

co-cited works

representative citing papers

Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling

cs.CL · 2023-04-03 · accept · novelty 8.0

Pythia releases 16 identically trained LLMs with full checkpoints and data tools to study training dynamics, scaling, memorization, and bias in language models.

Masked Autoencoders Are Scalable Vision Learners

cs.CV · 2021-11-11 · accept · novelty 8.0

Masked autoencoders with asymmetric encoder-decoder and 75% masking ratio enable scalable self-supervised pre-training of vision transformers, achieving 87.8% ImageNet-1K accuracy with ViT-Huge using only unlabeled data.

Emerging Properties in Self-Supervised Vision Transformers

cs.CV · 2021-04-29 · conditional · novelty 8.0

Self-supervised ViTs show emergent semantic segmentation and 78.3% k-NN accuracy on ImageNet; DINO reaches 80.1% linear evaluation with ViT-Base.

TallyTrain: Communication-Efficient Federated Distillation

cs.LG · 2026-06-30 · unverdicted · novelty 7.0

TallyTrain is a hard-label distillation protocol for federated learning that uses argmax transmission and optional sparse merges to match soft-label performance at up to 1000x lower communication cost.

Demystifying Pipeline Parallelism: First Theory for PipeDream

cs.LG · 2026-06-02 · unverdicted · novelty 7.0

Introduces Randomized PipeDream abstraction yielding first nonconvex convergence bound for PipeDream and proves delay scales as S squared for S stages.

From One-Pass SGD to Data Reuse: Mini-Batch Scaling Laws in Sketched Linear Regression

cs.LG · 2026-05-23 · unverdicted · novelty 7.0

Derives mini-batch scaling laws for sketched linear regression, with shared approximation terms and protocol-specific variance/fluctuation scalings under power-law spectrum and source condition.

LOSCAR-SGD: Local SGD with Communication-Computation Overlap and Delay-Corrected Sparse Model Averaging

cs.LG · 2026-05-20 · unverdicted · novelty 7.0

LOSCAR-SGD combines local updates, sparse model averaging, and communication-computation overlap with a delay-corrected merge rule, providing convergence rates for smooth non-convex objectives under worker heterogeneity.

Ringmaster LMO: Asynchronous Linear Minimization Oracle Momentum Method

cs.LG · 2026-05-18 · unverdicted · novelty 7.0

Ringmaster LMO extends delay-thresholding from ASGD to LMO-based momentum updates, providing convergence guarantees under (L0, L1)-smoothness and time-complexity bounds that recover optimal rates in the Euclidean case.

Scalable Distributed Stochastic Optimization via Bidirectional Compression: Beyond Pessimistic Limits

math.OC · 2026-05-08 · unverdicted · novelty 7.0

Inkheart SGD and M4 use bidirectional compression to achieve time complexities in distributed SGD that improve with worker count n and surpass prior lower bounds under a necessary structural assumption.

TENNOR: Trustworthy Execution for Neural Networks through Obliviousness and Retrievals

cs.CR · 2026-05-08 · unverdicted · novelty 7.0

TENNOR enables efficient private training of wide neural networks in TEEs by recasting sparsification as doubly oblivious LSH retrievals and introducing MP-WTA to cut hash table memory by 50x while preserving accuracy.

A Provably Robust Multi-Jet Framework applied to Active Flow Control of an Airfoil in Weakly Compressible Flow

physics.flu-dyn · 2026-04-29 · unverdicted · novelty 7.0

A new injective multi-jet framework for RL flow control provides jet-count-independent running cost upper bounds and enables superior coordinated jet strategies, achieving drag suppression beyond symmetric ideals on cylinders and aerodynamic efficiency gains from 53% to 73% on airfoils.

Multimodal Reasoning with LLM for Encrypted Traffic Interpretation: A Benchmark

cs.CR · 2026-04-09 · unverdicted · novelty 7.0

Creates the BGTD benchmark and mmTraffic architecture to enable explainable multimodal interpretation of encrypted network traffic using LLMs.

Fusion and Alignment Enhancement with Large Language Models for Tail-item Sequential Recommendation

cs.IR · 2026-04-04 · unverdicted · novelty 7.0

FAERec fuses collaborative ID embeddings with LLM semantic embeddings using adaptive gating and dual-level alignment to enhance tail-item sequential recommendations.

Analyzing Reverse Address Translation Overheads in Multi-GPU Scale-Up Pods

cs.DC · 2026-04-02 · unverdicted · novelty 7.0

Simulation study shows cold TLB misses in reverse address translation dominate latency for small collectives in multi-GPU pods, causing up to 1.4x degradation, while larger ones see diminishing returns.

FedBCD:Communication-Efficient Accelerated Block Coordinate Gradient Descent for Federated Learning

cs.LG · 2026-03-05 · unverdicted · novelty 7.0

FedBCGD reduces communication in federated learning by a factor of 1/N through block-wise parameter updates with accelerated convergence guarantees.

Efficient GPU-Accelerated Training of a Neuroevolution Potential with Analytical Gradients

cond-mat.dis-nn · 2025-07-01 · conditional · novelty 7.0

GNEP trains neuroevolution potentials with analytical gradients and Adam optimizer, cutting fitting time by orders of magnitude for Sb-Te systems while matching DFT accuracy on equation of state and radial distribution functions.

Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach

cs.LG · 2025-02-07 · unverdicted · novelty 7.0

A recurrent-depth architecture enables language models to improve reasoning performance by iterating computation in latent space, achieving gains equivalent to much larger models on benchmarks.

Mini-batch Estimation for Deep Cox Models: Statistical Foundations and Practical Guidance

stat.ML · 2024-08-05 · unverdicted · novelty 7.0

Mini-batch SGD optimizes a different objective than full partial likelihood in Cox models, but the resulting mb-MPLE is still consistent with optimal rates for neural nets and asymptotic normality for linear models.

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

cs.CL · 2024-05-07 · unverdicted · novelty 7.0

DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.

Segment Anything

cs.CV · 2023-04-05 · unverdicted · novelty 7.0

A promptable model trained on 1B masks achieves competitive zero-shot segmentation performance across tasks and is released publicly with its dataset.

Scalable Diffusion Models with Transformers

cs.CV · 2022-12-19 · unverdicted · novelty 7.0

DiTs achieve SOTA FID of 2.27 on ImageNet 256x256 by scaling transformer-based latent diffusion models, with performance improving consistently as Gflops increase.

OPT: Open Pre-trained Transformer Language Models

cs.CL · 2022-05-02 · unverdicted · novelty 7.0

OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.

VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning

cs.CV · 2021-05-11 · accept · novelty 7.0

VICReg prevents collapse in self-supervised image embeddings via explicit variance, invariance, and covariance regularization and matches state-of-the-art downstream performance.

A Simple Framework for Contrastive Learning of Visual Representations

cs.LG · 2020-02-13 · accept · novelty 7.0

SimCLR learns visual representations by contrasting augmented views of the same image and reaches 76.5% ImageNet top-1 accuracy with a linear classifier, matching a supervised ResNet-50.

citing papers explorer

Showing 49 of 99 citing papers.

AI Failures in the Eyes of the Downstream Developer: A First Look at Concerns, Practices, and Challenges cs.SE · 2025-03-25 · unverdicted · none · ref 41 · internal anchor
Mixed-methods study maps downstream developers' concerns, practices, and challenges with AI failures in PTM-based software.
Autoregressive Video Generation without Vector Quantization cs.CV · 2024-12-18 · unverdicted · none · ref 7 · internal anchor
NOVA reformulates video generation as non-quantized autoregressive frame-by-frame temporal prediction combined with set-by-set spatial prediction, outperforming prior AR video models and some diffusion models in efficiency and quality.
On the Convergence Theory of Pipeline Gradient-based Analog In-memory Training cs.LG · 2024-10-19 · unverdicted · none · ref 4 · internal anchor
Analog-SGD-AP converges with iteration complexity O(ε^{-2} + ε^{-1}) for multi-layer DNNs on AIMC hardware despite analog weight-update imperfections and asynchronous stale gradients.
Learning General Representation of 12-Lead Electrocardiogram with a Joint-Embedding Predictive Architecture cs.LG · 2024-10-11 · unverdicted · none · ref 37 · internal anchor
ECG-JEPA applies a joint-embedding predictive architecture with Cross-Pattern Attention to learn semantic representations from unlabeled 12-lead ECG data and reports state-of-the-art results on diagnostic classification, feature extraction, and segmentation.
Llemma: An Open Language Model For Mathematics cs.CL · 2023-10-16 · unverdicted · none · ref 145 · internal anchor
Continued pretraining of Code Llama on Proof-Pile-2 yields Llemma, an open math-specialized LLM that beats known open base models on MATH and supports tool use plus formal proving out of the box.
Vision Transformers Need Registers cs.CV · 2023-09-28 · unverdicted · none · ref 105 · internal anchor
Adding register tokens to Vision Transformers eliminates high-norm background artifacts and raises state-of-the-art performance on dense visual prediction tasks.
Language Models (Mostly) Know What They Know cs.CL · 2022-07-11 · unverdicted · none · ref 277 · internal anchor
Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.
3D Magic Mirror: Clothing Reconstruction from a Single Image via a Causal Perspective cs.CV · 2022-04-27 · unverdicted · none · ref 12 · internal anchor
A causality-aware self-supervised pipeline reconstructs 3D non-rigid clothing from single images by embedding a structural causal map and two EM loops to disentangle camera, shape, texture, and illumination variables.
A General Language Assistant as a Laboratory for Alignment cs.CL · 2021-12-01 · conditional · none · ref 199 · internal anchor
Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.
YOLOX: Exceeding YOLO Series in 2021 cs.CV · 2021-07-18 · accept · none · ref 8 · internal anchor
YOLOX exceeds prior YOLO models by adopting anchor-free detection, decoupled heads, and SimOTA assignment to reach 50.0% AP on COCO for the large variant.
Scaling Laws for Transfer cs.LG · 2021-02-02 · unverdicted · none · ref 157 · internal anchor
Effective data transferred from pre-training to fine-tuning is described by a power law in model parameter count and fine-tuning dataset size, acting like a multiplier on the fine-tuning data.
Tent: Fully Test-time Adaptation by Entropy Minimization cs.LG · 2020-06-18 · conditional · none · ref 3 · internal anchor
Test-time entropy minimization adapts models by optimizing for confident predictions, reducing error on corrupted ImageNet-C and enabling source-free domain adaptation.
Adaptive Federated Optimization cs.LG · 2020-02-29 · unverdicted · none · ref 214 · internal anchor
Proposes federated adaptive optimizers (FedAdagrad, FedAdam, FedYogi) with convergence analysis for non-convex objectives under data heterogeneity and reports empirical gains over FedAvg.
Large Batch Optimization for Deep Learning: Training BERT in 76 minutes cs.LG · 2019-04-01 · conditional · none · ref 6 · internal anchor
LAMB optimizer trains BERT with batch size 32868, reducing training time to 76 minutes on TPUv3 Pod without performance loss.
A time-series classification framework for individual-level absenteeism prediction under severe class imbalance cs.AI · 2026-06-30 · unverdicted · none · ref 10 · internal anchor
A TSC framework separates historical attendance sequences from future labels and uses LSTM-FCN with BFL or G-Mean loss to achieve approximately 80% balanced accuracy for proactive absenteeism prediction on simulated data.
What Do Students Learn? A Feature-Level Analysis of Dark Knowledge cs.LG · 2026-06-02 · unverdicted · none · ref 24 · internal anchor
Confusion Distillation is a self-distillation method that treats dataset-level confusion patterns as dynamic soft targets, achieving competitive results on ResNet models for CIFAR-100 without a teacher.
A Training-Efficient Transformer-Based Anti-Spoofing Network for Logical Access in ASVspoof 5 cs.SD · 2026-06-02 · unverdicted · none · ref 17 · 2 links · internal anchor
TFPARN applies a Transformer encoder with attention pooling and combined focal-pairwise losses to ASVspoof 5 Track 1, reporting minDCF 0.2430, EER 12.52%, lowest inference memory, and faster training than re-implemented baselines.
SaluNet: Enabling Total Plasticity in Normalization-Free Deep Networks cs.CV · 2026-06-01 · unverdicted · none · ref 16 · internal anchor
SaluNet replaces normalization layers with the SALU activation and reports competitive accuracies on CIFAR-10/100 and ImageNet-1K without normalization.
MoEIoU: Rethinking Bounding-Box Regression as a Mixture of Experts cs.CV · 2026-05-30 · unverdicted · none · ref 6 · internal anchor
MoEIoU is a mixture-of-experts IoU loss using log-sum-exp aggregation and curriculum weighting that reports consistent gains over prior IoU losses on PASCAL VOC, HRIPCB, and MS COCO with YOLO models.
Orion: Enabling Self-adaptive Memory Management for On-device Online Continual Learning eess.SY · 2026-05-26 · unverdicted · none · ref 31 · internal anchor
Orion is a self-adaptive memory management framework for on-device online continual learning that co-optimizes latency, plasticity, and stability via URGE-based reallocation and prefetching.
Anytime Training with Schedule-Free Spectral Optimization cs.LG · 2026-05-21 · unverdicted · none · ref 68 · internal anchor
SF-NorMuon is a new schedule-free spectral optimizer that closes the gap with tuned AdamW on 125M-772M parameter models across 1-8x Chinchilla horizons while providing stationarity guarantees.
Fine-Tuning Without Forgetting via Loss-Adaptive Learning Rates cs.LG · 2026-05-19 · unverdicted · none · ref 13 · internal anchor
FINCH is a loss-adaptive learning-rate schedule that reduces forgetting by 93% on average during LLM fine-tuning while matching standard task performance across several benchmarks.
Information theoretic underpinning of self-supervised learning by clustering cs.LG · 2026-05-12 · unverdicted · none · ref 156 · internal anchor
SSL clustering is derived as KL-divergence optimization where a teacher-distribution constraint normalizes via inverse cluster priors and simplifies to batch centering by Jensen's inequality.
Multi-Narrow Transformation as a Single-Model Ensemble: Boundary Conditions, Mechanisms, and Failure Modes cs.LG · 2026-05-12 · unverdicted · none · ref 7 · internal anchor
Multi-narrow single-model ensembles outperform wide baselines in low-data image classification by learning diverse features but underperform in data-rich settings where training favors few paths.
Probing Routing-Conditional Calibration in Attention-Residual Transformers cs.CV · 2026-05-11 · unverdicted · none · ref 2 · internal anchor
Routing summaries and auxiliary features do not provide stable evidence of conditional miscalibration in AR transformers once confidence-matched baselines, capacity controls, and permutation nulls are applied.
Rennala MVR: Improved Time Complexity for Parallel Stochastic Optimization via Momentum-Based Variance Reduction math.OC · 2026-05-09 · unverdicted · none · ref 47 · internal anchor
Rennala MVR improves time complexity over Rennala SGD for smooth nonconvex stochastic optimization in heterogeneous parallel systems under a mean-squared smoothness assumption.
Unleashing Scalable Context Parallelism for Foundation Models Pre-Training via FCP cs.DC · 2026-05-08 · unverdicted · none · ref 44 · internal anchor
FCP shards sequences at block level with flexible P2P communication and bin-packing to achieve near-linear scaling up to 256 GPUs and 1.13x-2.21x higher attention MFU in foundation model pre-training.
Trust, but Verify: Peeling Low-Bit Transformer Networks for Training Monitoring cs.LG · 2026-05-04 · unverdicted · none · ref 17 · internal anchor
A layer-wise peeling framework creates reference bounds to diagnose under-optimized layers in trained decoder-only transformers, including low-bit and quantized versions.
Lightweight Distillation of SAM 3 and DINOv3 for Edge-Deployable Individual-Level Livestock Monitoring and Longitudinal Visual Analytics cs.CV · 2026-04-29 · unverdicted · none · ref 3 · internal anchor
Distilled SAM 3 and DINOv3 models deliver near-teacher accuracy in pig tracking (92.29% MOTA, 96.15% IDF1) and behavior classification while achieving 7.77x parameter reduction and fitting on Jetson Orin NX with headroom.
In-context modeling as a retrain-free paradigm for foundation models in computational science cs.CE · 2026-04-25 · unverdicted · none · ref 39 · internal anchor
In-Context Modeling lets one trained model generalize across unseen materials, geometries, and conditions in computational physics by treating measurements as context for inference.
A Progressive Training Strategy for Vision-Language Models to Counteract Spatio-Temporal Hallucinations in Embodied Reasoning cs.AI · 2026-04-12 · unverdicted · none · ref 3 · internal anchor
A progressive training framework using spatiotemporal chain-of-thought data reduces the forward-backward temporal query performance gap in VLMs from over 70% to 6.53%.
Sampling Parallelism for Fast and Efficient Bayesian Learning cs.LG · 2026-04-06 · unverdicted · none · ref 14 · internal anchor
Sampling parallelism distributes Bayesian sample evaluations across GPUs for near-perfect scaling, lower memory use, and faster convergence via per-GPU data augmentations, outperforming pure data parallelism in diversity.
Improved Mean Flows: On the Challenges of Fastforward Generative Models cs.CV · 2025-12-01 · unverdicted · none · ref 13 · internal anchor
Improved MeanFlow (iMF) reaches 1.72 FID on ImageNet 256x256 with one function evaluation by reformulating the training objective as a regression on instantaneous velocity and treating guidance as flexible conditioning variables.
Self-Supervised Learning for Real-World Object Detection: a Survey cs.CV · 2024-10-09 · unverdicted · none · ref 72 · internal anchor
Survey benchmarks SSL instance discrimination and masked image modeling for object detection, finding instance discrimination suits CNN encoders while MIM suits ViT encoders and custom pre-training, especially for small objects.
A Highly Efficient Distributed Deep Learning System For Automatic Speech Recognition eess.AS · 2019-07-10 · unverdicted · none · ref 11 · internal anchor
ADPSGD and Hierarchical-ADPSGD support 3x larger batches than SSGD for ASR, training SWB-2000 to 7.6% WER on SWB and 13.2% on CH in 5.2 hours on 64 V100 GPUs.
Fast Training of Sparse Graph Neural Networks on Dense Hardware stat.ML · 2019-06-27 · unverdicted · none · ref 2 · internal anchor
Techniques enable training the sparse GNN from Allamanis et al. [2018] on dense TPU hardware in 13 minutes versus a full day originally.
Gradient Noise Convolution (GNC): Smoothing Loss Function for Distributed Large-Batch SGD cs.LG · 2019-06-26 · unverdicted · none · ref 7 · internal anchor
GNC convolves stochastic gradient noise to smooth sharp minima in large-batch SGD, outperforming isotropic noise for better generalization in distributed deep learning.
Unified Neural Scaling Laws cs.LG · 2026-05-25 · unverdicted · none · ref 8 · internal anchor
Presents a single functional form for neural scaling that unifies multiple scaling dimensions and claims higher extrapolation accuracy than prior forms across diverse tasks and architectures.
Accelerating Redshift-Conditioned Galaxy Image Synthesis with One-step Generative Modeling astro-ph.IM · 2026-05-17 · unverdicted · none · ref 64 · internal anchor
One-step pixel-MeanFlow models recover key galaxy morphology statistics at orders-of-magnitude lower computational cost than standard DDPM sampling while remaining weaker on fine-grained structure.
Accelerated Gradient Descent for Faster Convergence with Minimal Overhead cs.LG · 2026-05-15 · unverdicted · none · ref 34 · internal anchor
CT-AGD accelerates first-order optimization in deep learning by using finite-difference curvature estimates and noise-mitigation heuristics, achieving equivalent accuracy with 33% fewer training epochs and overhead comparable to Adam.
MMCORE: MultiModal COnnection with Representation Aligned Latent Embeddings cs.CV · 2026-04-21 · unverdicted · none · ref 8 · internal anchor
MMCORE transfers VLM reasoning into diffusion-based image generation and editing via aligned latent embeddings from learnable queries, outperforming baselines on text-to-image and editing tasks.
DeepSeek LLM: Scaling Open-Source Language Models with Longtermism cs.CL · 2024-01-05 · unverdicted · none · ref 160 · internal anchor
DeepSeek LLM 67B exceeds LLaMA-2 70B on code, mathematics and reasoning benchmarks after pre-training on 2 trillion tokens and alignment via SFT and DPO.
Massive MIMO Channel Prediction Via Meta-Learning and Deep Denoising: Is a Small Dataset Enough? cs.IT · 2022-10-17 · unverdicted · none · ref 39 · internal anchor
MAML-based predictor with DIP denoising improves massive MIMO channel prediction accuracy with small datasets, especially at low SNR.
FOSNet: An End-to-End Trainable Deep Neural Network for Scene Recognition cs.CV · 2019-07-17 · unverdicted · none · ref 40 · internal anchor
FOSNet fuses object and scene features via CNN and uses scene coherence loss to report SOTA accuracies of 60.14% on Places2 and 90.37% on MIT Indoor67.
Principles and Practice of Deep Representation Learning: or a Mathematical Theory of Memory cs.LG · 2026-06-04 · unverdicted · none · ref 30 · internal anchor
The book presents principles from optimization and information theory to explain deep network architectures and enable new interpretable models.
There Will Be a Scientific Theory of Deep Learning stat.ML · 2026-04-23 · unverdicted · none · ref 266 · internal anchor
A mechanics of the learning process is emerging in deep learning theory, characterized by dynamics, coarse statistics, and falsifiable predictions across idealized settings, limits, laws, hyperparameters, and universal behaviors.
Analysis of Hyperparameter Optimization Effects on Lightweight Deep Models for Real-Time Image Classification cs.CV · 2025-07-31 · unverdicted · none · ref 22 · internal anchor
Hyperparameter tuning on seven lightweight models trained on a 90k-image ImageNet subset yields 1.5-3.5% top-1 accuracy gains, with RepVGG-A2 and MobileNetV3-L achieving sub-5ms latency and over 9800 FPS on GPU.
DistributedEstimator: Distributed Training of Quantum Neural Networks via Circuit Cutting cs.DC · 2026-02-18 · unreviewed · ref 37 · internal anchor
Fine-Tuning Causal LLMs for Text Classification: Embedding-Based vs. Instruction-Based Approaches cs.CL · 2025-12-14 · unreviewed · ref 21 · internal anchor

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

hub tools

citation-role summary

citation-polarity summary

claims ledger

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer