hub

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Sergey Ioffe, Christian Szegedy · 2015 · cs.LG · arXiv 1502.03167

23 Pith papers cite this work. Polarity classification is still indexing.

23 Pith papers citing it

open full Pith review browse 23 citing papers arXiv PDF

abstract

Training Deep Neural Networks is complicated by the fact that the distribution of each layer's inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities. We refer to this phenomenon as internal covariate shift, and address the problem by normalizing layer inputs. Our method draws its strength from making normalization a part of the model architecture and performing the normalization for each training mini-batch. Batch Normalization allows us to use much higher learning rates and be less careful about initialization. It also acts as a regularizer, in some cases eliminating the need for Dropout. Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin. Using an ensemble of batch-normalized networks, we improve upon the best published result on ImageNet classification: reaching 4.9% top-5 validation error (and 4.8% test error), exceeding the accuracy of human raters.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

method 1

citation-polarity summary

use method 1

representative citing papers

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

cs.LG · 2017-01-23 · accept · novelty 8.0

A noisy top-k gated mixture-of-experts layer between LSTMs scales neural networks to 137B parameters with sub-linear compute, beating SOTA on language modeling and machine translation.

Density estimation using Real NVP

cs.LG · 2016-05-27 · accept · novelty 8.0

Real NVP uses affine coupling layers to create invertible transformations that support exact density estimation, sampling, and latent inference without approximations.

Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks

cs.LG · 2015-11-19 · accept · novelty 8.0

DCGANs with architectural constraints learn a hierarchy of representations from object parts to scenes in both generator and discriminator across image datasets.

Rethinking the Need for Source Models: Source-Free Domain Adaptation from Scratch Guided by a Vision-Language Model

cs.CV · 2026-05-04 · unverdicted · novelty 7.0

The paper introduces the VODA setting for domain adaptation from scratch using vision-language models and presents TS-DRD, which achieves competitive performance on standard benchmarks without source models.

Physics-informed, Generative Adversarial Design of Funicular Shells

cs.CE · 2026-04-17 · unverdicted · novelty 7.0

A modified DCGAN with an auxiliary discriminator using the membrane factor generates stable, previously unseen funicular shells optimized for pure compression in three dimensions.

High Fidelity Neural Audio Compression

eess.AS · 2022-10-24 · accept · novelty 7.0

EnCodec is an end-to-end trained streaming neural audio codec that uses a single multiscale spectrogram discriminator and a gradient-normalizing loss balancer to achieve higher fidelity than prior methods at the same bitrates for 24 kHz mono and 48 kHz stereo audio.

A Simple Framework for Contrastive Learning of Visual Representations

cs.LG · 2020-02-13 · accept · novelty 7.0

SimCLR learns visual representations by contrasting augmented views of the same image and reaches 76.5% ImageNet top-1 accuracy with a linear classifier, matching a supervised ResNet-50.

Progressive Growing of GANs for Improved Quality, Stability, and Variation

cs.NE · 2017-10-27 · accept · novelty 7.0

Progressive growing stabilizes GAN training to produce high-resolution images of unprecedented quality and achieves a record unsupervised inception score of 8.80 on CIFAR10.

The Kinetics Human Action Video Dataset

cs.CV · 2017-05-19 · accept · novelty 7.0

Kinetics is a new video dataset of 400 human actions with over 160000 ten-second clips collected from YouTube, accompanied by baseline action-classification results from neural networks.

MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

cs.CV · 2017-04-17 · accept · novelty 7.0

MobileNets introduce depthwise separable convolutions plus width and resolution multipliers to produce efficient CNNs that trade off latency and accuracy for mobile and embedded vision applications.

Continuous control with deep reinforcement learning

cs.LG · 2015-09-09 · accept · novelty 7.0

DDPG is a model-free actor-critic algorithm that learns continuous control policies end-to-end from states or pixels using deterministic policy gradients and deep networks, solving more than 20 physics tasks competitively with full-information planning methods.

LSUN: Construction of a Large-scale Image Dataset using Deep Learning with Humans in the Loop

cs.CV · 2015-06-10 · accept · novelty 7.0

LSUN dataset of one million images per category across 30 classes is constructed via iterative human-in-the-loop deep learning labeling.

Demystifying Manifold Constraints in LLM Pre-training

cs.LG · 2026-05-06 · unverdicted · novelty 6.0

Manifold constraints via the new MACRO optimizer independently bound activation scales and enforce rotational equilibrium in LLM pre-training, subsuming RMS normalization and decoupled weight decay while delivering competitive performance with convergence guarantees.

Revisiting Feature Prediction for Learning Visual Representations from Video

cs.CV · 2024-02-15 · conditional · novelty 6.0

V-JEPA models trained only on feature prediction from 2 million public videos achieve 81.9% on Kinetics-400, 72.2% on Something-Something-v2, and 77.9% on ImageNet-1K using frozen ViT-H/16 backbones.

Rethinking Atrous Convolution for Semantic Image Segmentation

cs.CV · 2017-06-17 · unverdicted · novelty 6.0

DeepLabv3 improves semantic segmentation by capturing multi-scale context with cascaded or parallel atrous convolutions and adding global context to ASPP, achieving better results on PASCAL VOC 2012 without DenseCRF post-processing.

On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

cs.LG · 2016-09-15 · unverdicted · novelty 6.0

Large-batch methods converge to sharp minima causing a generalization gap, while small-batch methods reach flat minima due to inherent gradient noise.

Unveiling Hidden Lyman Alpha Emitters in the DESI DR1 Data

astro-ph.GA · 2026-05-12 · unverdicted · novelty 5.0

A CNN detects 19,685 LAEs at z=2-3.5 in DESI DR1 spectra with 95% purity and completeness.

A sound-horizon-free measurement of the Hubble constant from DESI DR2 baryon acoustic oscillations using artificial neural networks

astro-ph.CO · 2026-04-27 · unverdicted · novelty 5.0

Neural network reconstruction of DESI DR2 BAO, SNe Ia, and cosmic chronometer data gives H0 = 71.5 ± 2.2 km s^{-1} Mpc^{-1} without sound horizon input.

Distributional Value Estimation Without Target Networks for Robust Quality-Diversity

cs.LG · 2026-04-22 · unverdicted · novelty 5.0

QDHUAC is a distributional, target-free QD-RL method that enables stable high-UTD training and competitive performance on Brax locomotion tasks using far fewer environment steps than prior approaches.

Enhancing Event Reconstruction in Hyper-Kamiokande with Machine Learning: A ResNet Implementation

hep-ex · 2026-04-15 · conditional · novelty 5.0

ResNet models classify four particle types and regress vertex, direction, and momentum in Hyper-Kamiokande with resolutions matching likelihood methods but at 30,000-50,000x faster inference on GPU.

YOLOv4: Optimal Speed and Accuracy of Object Detection

cs.CV · 2020-04-23 · unverdicted · novelty 5.0

YOLOv4 achieves 43.5% AP (65.7% AP50) on MS COCO at ~65 FPS on Tesla V100 by integrating WRC, CSP, CmBN, SAT, Mish activation, Mosaic augmentation, DropBlock, and CIoU loss.

A Wasserstein GAN-based climate scenario generator for risk management and insurance: the case of soil subsidence

cs.LG · 2026-04-22 · unverdicted · novelty 4.0

A conditional Wasserstein GAN generates plausible future SWI drought trajectories for French insurance risk management under climate change.

RadarCNN: Learning-based Indoor Object Classification from IQ Imaging Radar Data

eess.SP · 2026-04-08 · unverdicted · novelty 4.0

RadarCNN classifies indoor objects from radar IQ data at 97-99% accuracy, holding at ~50% under noise and occlusion.

citing papers explorer

Showing 23 of 23 citing papers.

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer cs.LG · 2017-01-23 · accept · none · ref 23 · internal anchor
A noisy top-k gated mixture-of-experts layer between LSTMs scales neural networks to 137B parameters with sub-linear compute, beating SOTA on language modeling and machine translation.
Density estimation using Real NVP cs.LG · 2016-05-27 · accept · none · ref 31 · internal anchor
Real NVP uses affine coupling layers to create invertible transformations that support exact density estimation, sampling, and latent inference without approximations.
Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks cs.LG · 2015-11-19 · accept · none · ref 8 · internal anchor
DCGANs with architectural constraints learn a hierarchy of representations from object parts to scenes in both generator and discriminator across image datasets.
Rethinking the Need for Source Models: Source-Free Domain Adaptation from Scratch Guided by a Vision-Language Model cs.CV · 2026-05-04 · unverdicted · none · ref 43 · internal anchor
The paper introduces the VODA setting for domain adaptation from scratch using vision-language models and presents TS-DRD, which achieves competitive performance on standard benchmarks without source models.
Physics-informed, Generative Adversarial Design of Funicular Shells cs.CE · 2026-04-17 · unverdicted · none · ref 39 · internal anchor
A modified DCGAN with an auxiliary discriminator using the membrane factor generates stable, previously unseen funicular shells optimized for pure compression in three dimensions.
High Fidelity Neural Audio Compression eess.AS · 2022-10-24 · accept · none · ref 17 · internal anchor
EnCodec is an end-to-end trained streaming neural audio codec that uses a single multiscale spectrogram discriminator and a gradient-normalizing loss balancer to achieve higher fidelity than prior methods at the same bitrates for 24 kHz mono and 48 kHz stereo audio.
A Simple Framework for Contrastive Learning of Visual Representations cs.LG · 2020-02-13 · accept · none · ref 29 · internal anchor
SimCLR learns visual representations by contrasting augmented views of the same image and reaches 76.5% ImageNet top-1 accuracy with a linear classifier, matching a supervised ResNet-50.
Progressive Growing of GANs for Improved Quality, Stability, and Variation cs.NE · 2017-10-27 · accept · none · ref 23 · internal anchor
Progressive growing stabilizes GAN training to produce high-resolution images of unprecedented quality and achieves a record unsupervised inception score of 8.80 on CIFAR10.
The Kinetics Human Action Video Dataset cs.CV · 2017-05-19 · accept · none · ref 12 · internal anchor
Kinetics is a new video dataset of 400 human actions with over 160000 ten-second clips collected from YouTube, accompanied by baseline action-classification results from neural networks.
MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications cs.CV · 2017-04-17 · accept · none · ref 13 · internal anchor
MobileNets introduce depthwise separable convolutions plus width and resolution multipliers to produce efficient CNNs that trade off latency and accuracy for mobile and embedded vision applications.
Continuous control with deep reinforcement learning cs.LG · 2015-09-09 · accept · none · ref 3 · internal anchor
DDPG is a model-free actor-critic algorithm that learns continuous control policies end-to-end from states or pixels using deterministic policy gradients and deep networks, solving more than 20 physics tasks competitively with full-information planning methods.
LSUN: Construction of a Large-scale Image Dataset using Deep Learning with Humans in the Loop cs.CV · 2015-06-10 · accept · none · ref 8 · internal anchor
LSUN dataset of one million images per category across 30 classes is constructed via iterative human-in-the-loop deep learning labeling.
Demystifying Manifold Constraints in LLM Pre-training cs.LG · 2026-05-06 · unverdicted · none · ref 24 · internal anchor
Manifold constraints via the new MACRO optimizer independently bound activation scales and enforce rotational equilibrium in LLM pre-training, subsuming RMS normalization and decoupled weight decay while delivering competitive performance with convergence guarantees.
Revisiting Feature Prediction for Learning Visual Representations from Video cs.CV · 2024-02-15 · conditional · none · ref 135 · internal anchor
V-JEPA models trained only on feature prediction from 2 million public videos achieve 81.9% on Kinetics-400, 72.2% on Something-Something-v2, and 77.9% on ImageNet-1K using frozen ViT-H/16 backbones.
Rethinking Atrous Convolution for Semantic Image Segmentation cs.CV · 2017-06-17 · unverdicted · none · ref 38 · internal anchor
DeepLabv3 improves semantic segmentation by capturing multi-scale context with cascaded or parallel atrous convolutions and adding global context to ASPP, achieving better results on PASCAL VOC 2012 without DenseCRF post-processing.
On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima cs.LG · 2016-09-15 · unverdicted · none · ref 6 · internal anchor
Large-batch methods converge to sharp minima causing a generalization gap, while small-batch methods reach flat minima due to inherent gradient noise.
Unveiling Hidden Lyman Alpha Emitters in the DESI DR1 Data astro-ph.GA · 2026-05-12 · unverdicted · none · ref 31 · internal anchor
A CNN detects 19,685 LAEs at z=2-3.5 in DESI DR1 spectra with 95% purity and completeness.
A sound-horizon-free measurement of the Hubble constant from DESI DR2 baryon acoustic oscillations using artificial neural networks astro-ph.CO · 2026-04-27 · unverdicted · none · ref 35 · internal anchor
Neural network reconstruction of DESI DR2 BAO, SNe Ia, and cosmic chronometer data gives H0 = 71.5 ± 2.2 km s^{-1} Mpc^{-1} without sound horizon input.
Distributional Value Estimation Without Target Networks for Robust Quality-Diversity cs.LG · 2026-04-22 · unverdicted · none · ref 25 · internal anchor
QDHUAC is a distributional, target-free QD-RL method that enables stable high-UTD training and competitive performance on Brax locomotion tasks using far fewer environment steps than prior approaches.
Enhancing Event Reconstruction in Hyper-Kamiokande with Machine Learning: A ResNet Implementation hep-ex · 2026-04-15 · conditional · none · ref 27 · internal anchor
ResNet models classify four particle types and regress vertex, direction, and momentum in Hyper-Kamiokande with resolutions matching likelihood methods but at 30,000-50,000x faster inference on GPU.
YOLOv4: Optimal Speed and Accuracy of Object Detection cs.CV · 2020-04-23 · unverdicted · none · ref 32 · internal anchor
YOLOv4 achieves 43.5% AP (65.7% AP50) on MS COCO at ~65 FPS on Tesla V100 by integrating WRC, CSP, CmBN, SAT, Mish activation, Mosaic augmentation, DropBlock, and CIoU loss.
A Wasserstein GAN-based climate scenario generator for risk management and insurance: the case of soil subsidence cs.LG · 2026-04-22 · unverdicted · none · ref 25 · internal anchor
A conditional Wasserstein GAN generates plausible future SWI drought trajectories for French insurance risk management under climate change.
RadarCNN: Learning-based Indoor Object Classification from IQ Imaging Radar Data eess.SP · 2026-04-08 · unverdicted · none · ref 17 · internal anchor
RadarCNN classifies indoor objects from radar IQ data at 97-99% accuracy, holding at ~50% under noise and occlusion.

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer