hub Mixed citations

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Sergey Ioffe, Christian Szegedy · 2015 · cs.LG · arXiv 1502.03167

Mixed citation behavior. Most common role is method (54%).

93 Pith papers citing it

Method 54% of classified citations

open full Pith review browse 93 citing papers arXiv PDF

abstract

Training Deep Neural Networks is complicated by the fact that the distribution of each layer's inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities. We refer to this phenomenon as internal covariate shift, and address the problem by normalizing layer inputs. Our method draws its strength from making normalization a part of the model architecture and performing the normalization for each training mini-batch. Batch Normalization allows us to use much higher learning rates and be less careful about initialization. It also acts as a regularizer, in some cases eliminating the need for Dropout. Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin. Using an ensemble of batch-normalized networks, we improve upon the best published result on ImageNet classification: reaching 4.9% top-5 validation error (and 4.8% test error), exceeding the accuracy of human raters.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

method 7 background 6

citation-polarity summary

use method 7 background 6

claims ledger

abstract Training Deep Neural Networks is complicated by the fact that the distribution of each layer's inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities. We refer to this phenomenon as internal covariate shift, and address the problem by normalizing layer inputs. Our method draws its strength from making normalization a part of the model architecture and performing the normalization for each trai
method convolution and a 3 × 3 max-pool, followed by four stages of residual blocks with channel depths {64, 128, 256, 512}. An adaptive-average-pooling layer reduces the spatial dimensions, after which a fully connected layer projects to the required out- put dimension. Kaiming-normal weight initialization [26] and zero-initialized residual-branch batch norms [27] are applied. ResNet-50 networks were also evaluated but yielded infe- rior reconstruction performance relative to ResNet-152, despite offer
background the resource restrictions (latency, size) for their application. MobileNets primarily focus on optimizing for latency but also yield small networks. Many papers on small networks focus only on size but do not consider speed. MobileNets are built primarily from depthwise separable convolutions initially introduced in [26] and subsequently used in Inception models [13] to reduce the computation in the ﬁrst few layers. Flattened networks [16] build a network out of fully factorized convolutions and
method powerful class of bijective functions which enable exact and tractable density evaluation and exact and tractable inference. Moreover, the resulting cost function does not to rely on a ﬁxed form reconstruction cost such as square error [38, 47], and generates sharper samples as a result. Also, this ﬂexibility helps us leverage recent advances in batch normalization [31] and residual networks [24, 25] to deﬁne a very deep multi-scale architecture with multiple levels of abstraction. 3.1 Change of
method To be concrete, we duplicate several copies of the original last block in ResNet [32] and arrange them in cascade, and also revisit the ASPP module [ 11] which contains several atrous convolutions in parallel. Note that our cascaded mod- ules are applied directly on the feature maps instead of belief maps. For the proposed modules, we experimentally ﬁnd it important to train with batch normalization [38]. To further capture global context, we propose to augment ASPP with image-level features, si
method times they are cropped, resized and generally pre-processed in different ways (but, nevertheless, the image classiﬁer could localize the same clip). So even though each clip is from a distinct video there were still duplications. We devised a process for de-duplicating across YouTube links which operated independently for each class. First we computed Inception-V1 [12] feature vectors (taken after last average pooling layer) on 224 × 224 center crops of 25 uni- formly sampled frames from each vi
background Fei-Fei, R. Fergus, and P. Perona. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. CVIU, 106(1):59-70, 2007. [7] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectiﬁers: Surpassing human-level performance on imagenet classiﬁcation. arXiv preprint arXiv:1502.01852, 2015. [8] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint ar

co-cited works

representative citing papers

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

cs.LG · 2017-01-23 · accept · novelty 8.0

A noisy top-k gated mixture-of-experts layer between LSTMs scales neural networks to 137B parameters with sub-linear compute, beating SOTA on language modeling and machine translation.

Density estimation using Real NVP

cs.LG · 2016-05-27 · accept · novelty 8.0

Real NVP uses affine coupling layers to create invertible transformations that support exact density estimation, sampling, and latent inference without approximations.

Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks

cs.LG · 2015-11-19 · accept · novelty 8.0

DCGANs with architectural constraints learn a hierarchy of representations from object parts to scenes in both generator and discriminator across image datasets.

Determining star formation histories and age-metallicity relations with convolutional neural networks

astro-ph.GA · 2026-05-13 · unverdicted · novelty 7.0

A CNN with attention and shared latent space recovers SFHs and metallicities from spectro-photometric data with ~0.12 dex age and ~0.03 dex metallicity dispersion while running thousands of times faster than full spectral fitting.

Rethinking the Need for Source Models: Source-Free Domain Adaptation from Scratch Guided by a Vision-Language Model

cs.CV · 2026-05-04 · unverdicted · novelty 7.0

The paper introduces the VODA setting for domain adaptation from scratch using vision-language models and presents TS-DRD, which achieves competitive performance on standard benchmarks without source models.

Physics-informed, Generative Adversarial Design of Funicular Shells

cs.CE · 2026-04-17 · unverdicted · novelty 7.0

A modified DCGAN with an auxiliary discriminator using the membrane factor generates stable, previously unseen funicular shells optimized for pure compression in three dimensions.

Deep Learning for CMB Foreground Removal and Beam Deconvolution: A U-Net GAN Approach

astro-ph.IM · 2025-08-29 · unverdicted · novelty 7.0

A U-Net GAN reconstructs CMB T and E maps from Planck-like simulations with foregrounds and systematics, achieving under 1% error outside the Galactic region and demonstrating first-time correction for non-circular beams and asymmetric scans.

High Fidelity Neural Audio Compression

eess.AS · 2022-10-24 · accept · novelty 7.0

EnCodec is an end-to-end trained streaming neural audio codec that uses a single multiscale spectrogram discriminator and a gradient-normalizing loss balancer to achieve higher fidelity than prior methods at the same bitrates for 24 kHz mono and 48 kHz stereo audio.

A Simple Framework for Contrastive Learning of Visual Representations

cs.LG · 2020-02-13 · accept · novelty 7.0

SimCLR learns visual representations by contrasting augmented views of the same image and reaches 76.5% ImageNet top-1 accuracy with a linear classifier, matching a supervised ResNet-50.

MVB: A Large-Scale Dataset for Baggage Re-Identification and Merged Siamese Networks

cs.CV · 2019-07-26 · unverdicted · novelty 7.0

Releases MVB, a multi-view baggage re-identification dataset with 4519 identities and 22660 images, plus a merged Siamese network baseline evaluated on it.

Learning to learn with quantum neural networks via classical neural networks

quant-ph · 2019-07-11 · unverdicted · novelty 7.0

Classical RNNs trained on small instances provide parameter initializations for QAOA and VQE that reduce total optimization iterations and generalize across problem sizes.

IRNet: A General Purpose Deep Residual Regression Framework for Materials Discovery

physics.comp-ph · 2019-07-07 · unverdicted · novelty 7.0

IRNet uses per-layer residual shortcuts in fully connected networks to achieve better prediction accuracy and training convergence than prior ML methods on OQMD and Materials Project datasets for material properties.

Importance Estimation for Neural Network Pruning

cs.LG · 2019-06-25 · unverdicted · novelty 7.0

Taylor-expansion importance scoring enables layer-agnostic pruning of neural networks that outperforms prior methods on ImageNet accuracy-FLOPs trade-offs.

Progressive Growing of GANs for Improved Quality, Stability, and Variation

cs.NE · 2017-10-27 · accept · novelty 7.0

Progressive growing stabilizes GAN training to produce high-resolution images of unprecedented quality and achieves a record unsupervised inception score of 8.80 on CIFAR10.

The Kinetics Human Action Video Dataset

cs.CV · 2017-05-19 · accept · novelty 7.0

Kinetics is a new video dataset of 400 human actions with over 160000 ten-second clips collected from YouTube, accompanied by baseline action-classification results from neural networks.

MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

cs.CV · 2017-04-17 · accept · novelty 7.0

MobileNets introduce depthwise separable convolutions plus width and resolution multipliers to produce efficient CNNs that trade off latency and accuracy for mobile and embedded vision applications.

Continuous control with deep reinforcement learning

cs.LG · 2015-09-09 · accept · novelty 7.0

DDPG is a model-free actor-critic algorithm that learns continuous control policies end-to-end from states or pixels using deterministic policy gradients and deep networks, solving more than 20 physics tasks competitively with full-information planning methods.

LSUN: Construction of a Large-scale Image Dataset using Deep Learning with Humans in the Loop

cs.CV · 2015-06-10 · accept · novelty 7.0

LSUN dataset of one million images per category across 30 classes is constructed via iterative human-in-the-loop deep learning labeling.

Higher-order effects in amplitude-assisted polarisation extraction with machine-learning techniques

hep-ph · 2026-07-01 · unverdicted · novelty 6.0

First NLO-QCD amplitude-assisted ML regression for longitudinal-boson production rate in di-boson events at the LHC, benchmarked against random forests.

A Dual-domain Refinement Network with FBP-based Jacobian Learning for Sparse-view Dual-Energy CT Material Decomposition

cs.CV · 2026-06-29 · unverdicted · novelty 6.0

DECT-DRNet combines an FBP-based learnable Jacobian approximation with dual-domain Fourier regularization to improve accuracy of multi-material decomposition from sparse-view dual-energy CT data.

Acceleration of an algebraic multigrid pressure solver using graph neural networks

physics.comp-ph · 2026-06-17 · unverdicted · novelty 6.0

A modified graph convolutional isomorphism network predicts polynomial coefficients for a sparse pseudo-inverse AMG smoother, cutting V-cycles and delivering 4-37% wall-clock speedups while generalizing to larger and unseen meshes.

IV-Net: A neural network for elliptic PDEs with random and highly varying coefficients

math.NA · 2026-05-24 · unverdicted · novelty 6.0

IV-Net is a multigrid-inspired convolutional neural operator that approximates solutions to linear elliptic PDEs with high-contrast coefficients and shows better accuracy than POD and other neural operators on heterogeneous coercive problems.

CogAdapt: Transferring Clinical ECG Foundation Models to Wearable Cognitive Load Assessment via Lead Adaptation

cs.LG · 2026-05-21 · unverdicted · novelty 6.0

CogAdapt adapts clinical ECG foundation models to 3-lead wearable signals for cognitive load assessment via a LeadBridge adapter and ProFine progressive fine-tuning, outperforming scratch-trained models with macro-F1 of 0.626 and 0.768 on public datasets under leave-one-subject-out validation.

Q-PhotoNAS: Hybrid Quantum Neural Architecture Search Framework on Photonic Devices

quant-ph · 2026-05-21 · unverdicted · novelty 6.0

Q-PhotoNAS applies genetic algorithm search to jointly optimize classical preprocessing, phase encoding, and photonic circuit structure for hybrid quantum-classical models, reporting 99.44% and 98.78% accuracy on Digits and MNIST with projected photonic QPU inference times.

citing papers explorer

Showing 3 of 3 citing papers after filters.

Learning to learn with quantum neural networks via classical neural networks quant-ph · 2019-07-11 · unverdicted · none · ref 78 · internal anchor
Classical RNNs trained on small instances provide parameter initializations for QAOA and VQE that reduce total optimization iterations and generalize across problem sizes.
Q-PhotoNAS: Hybrid Quantum Neural Architecture Search Framework on Photonic Devices quant-ph · 2026-05-21 · unverdicted · none · ref 50 · internal anchor
Q-PhotoNAS applies genetic algorithm search to jointly optimize classical preprocessing, phase encoding, and photonic circuit structure for hybrid quantum-classical models, reporting 99.44% and 98.78% accuracy on Digits and MNIST with projected photonic QPU inference times.
Quantum Algorithm for Distributed Reduction of Entanglements (QADR): A Trainable and Simulation-Efficient QML Framework quant-ph · 2026-05-31 · unverdicted · none · ref 101 · internal anchor
QADR decomposes n-qubit VQCs into local sub-circuits to reduce memory from O(2^n) to O(n * 2^{2d+1}) and mitigate barren plateaus, scaling to 2000 features on MNIST and wind turbine diagnostics while matching classical models.

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

hub tools

citation-role summary

citation-polarity summary

claims ledger

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer