A noisy top-k gated mixture-of-experts layer between LSTMs scales neural networks to 137B parameters with sub-linear compute, beating SOTA on language modeling and machine translation.
hub Mixed citations
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
Mixed citation behavior. Most common role is method (58%).
abstract
Training Deep Neural Networks is complicated by the fact that the distribution of each layer's inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities. We refer to this phenomenon as internal covariate shift, and address the problem by normalizing layer inputs. Our method draws its strength from making normalization a part of the model architecture and performing the normalization for each training mini-batch. Batch Normalization allows us to use much higher learning rates and be less careful about initialization. It also acts as a regularizer, in some cases eliminating the need for Dropout. Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin. Using an ensemble of batch-normalized networks, we improve upon the best published result on ImageNet classification: reaching 4.9% top-5 validation error (and 4.8% test error), exceeding the accuracy of human raters.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract Training Deep Neural Networks is complicated by the fact that the distribution of each layer's inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities. We refer to this phenomenon as internal covariate shift, and address the problem by normalizing layer inputs. Our method draws its strength from making normalization a part of the model architecture and performing the normalization for each trai
- method convolution and a 3 × 3 max-pool, followed by four stages of residual blocks with channel depths {64, 128, 256, 512}. An adaptive-average-pooling layer reduces the spatial dimensions, after which a fully connected layer projects to the required out- put dimension. Kaiming-normal weight initialization [26] and zero-initialized residual-branch batch norms [27] are applied. ResNet-50 networks were also evaluated but yielded infe- rior reconstruction performance relative to ResNet-152, despite offer
- background the resource restrictions (latency, size) for their application. MobileNets primarily focus on optimizing for latency but also yield small networks. Many papers on small networks focus only on size but do not consider speed. MobileNets are built primarily from depthwise separable convolutions initially introduced in [26] and subsequently used in Inception models [13] to reduce the computation in the first few layers. Flattened networks [16] build a network out of fully factorized convolutions and
- method powerful class of bijective functions which enable exact and tractable density evaluation and exact and tractable inference. Moreover, the resulting cost function does not to rely on a fixed form reconstruction cost such as square error [38, 47], and generates sharper samples as a result. Also, this flexibility helps us leverage recent advances in batch normalization [31] and residual networks [24, 25] to define a very deep multi-scale architecture with multiple levels of abstraction. 3.1 Change of
- method To be concrete, we duplicate several copies of the original last block in ResNet [32] and arrange them in cascade, and also revisit the ASPP module [ 11] which contains several atrous convolutions in parallel. Note that our cascaded mod- ules are applied directly on the feature maps instead of belief maps. For the proposed modules, we experimentally find it important to train with batch normalization [38]. To further capture global context, we propose to augment ASPP with image-level features, si
- method times they are cropped, resized and generally pre-processed in different ways (but, nevertheless, the image classifier could localize the same clip). So even though each clip is from a distinct video there were still duplications. We devised a process for de-duplicating across YouTube links which operated independently for each class. First we computed Inception-V1 [12] feature vectors (taken after last average pooling layer) on 224 × 224 center crops of 25 uni- formly sampled frames from each vi
- background Fei-Fei, R. Fergus, and P. Perona. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. CVIU, 106(1):59-70, 2007. [7] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. arXiv preprint arXiv:1502.01852, 2015. [8] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint ar
co-cited works
representative citing papers
Real NVP uses affine coupling layers to create invertible transformations that support exact density estimation, sampling, and latent inference without approximations.
DCGANs with architectural constraints learn a hierarchy of representations from object parts to scenes in both generator and discriminator across image datasets.
A CNN with attention and shared latent space recovers SFHs and metallicities from spectro-photometric data with ~0.12 dex age and ~0.03 dex metallicity dispersion while running thousands of times faster than full spectral fitting.
The paper introduces the VODA setting for domain adaptation from scratch using vision-language models and presents TS-DRD, which achieves competitive performance on standard benchmarks without source models.
A modified DCGAN with an auxiliary discriminator using the membrane factor generates stable, previously unseen funicular shells optimized for pure compression in three dimensions.
A U-Net GAN reconstructs CMB T and E maps from Planck-like simulations with foregrounds and systematics, achieving under 1% error outside the Galactic region and demonstrating first-time correction for non-circular beams and asymmetric scans.
EnCodec is an end-to-end trained streaming neural audio codec that uses a single multiscale spectrogram discriminator and a gradient-normalizing loss balancer to achieve higher fidelity than prior methods at the same bitrates for 24 kHz mono and 48 kHz stereo audio.
SimCLR learns visual representations by contrasting augmented views of the same image and reaches 76.5% ImageNet top-1 accuracy with a linear classifier, matching a supervised ResNet-50.
Releases MVB, a multi-view baggage re-identification dataset with 4519 identities and 22660 images, plus a merged Siamese network baseline evaluated on it.
Classical RNNs trained on small instances provide parameter initializations for QAOA and VQE that reduce total optimization iterations and generalize across problem sizes.
IRNet uses per-layer residual shortcuts in fully connected networks to achieve better prediction accuracy and training convergence than prior ML methods on OQMD and Materials Project datasets for material properties.
Taylor-expansion importance scoring enables layer-agnostic pruning of neural networks that outperforms prior methods on ImageNet accuracy-FLOPs trade-offs.
Progressive growing stabilizes GAN training to produce high-resolution images of unprecedented quality and achieves a record unsupervised inception score of 8.80 on CIFAR10.
Kinetics is a new video dataset of 400 human actions with over 160000 ten-second clips collected from YouTube, accompanied by baseline action-classification results from neural networks.
MobileNets introduce depthwise separable convolutions plus width and resolution multipliers to produce efficient CNNs that trade off latency and accuracy for mobile and embedded vision applications.
DDPG is a model-free actor-critic algorithm that learns continuous control policies end-to-end from states or pixels using deterministic policy gradients and deep networks, solving more than 20 physics tasks competitively with full-information planning methods.
LSUN dataset of one million images per category across 30 classes is constructed via iterative human-in-the-loop deep learning labeling.
DECT-DRNet combines an FBP-based learnable Jacobian approximation with dual-domain Fourier regularization to improve accuracy of multi-material decomposition from sparse-view dual-energy CT data.
IV-Net is a multigrid-inspired convolutional neural operator that approximates solutions to linear elliptic PDEs with high-contrast coefficients and shows better accuracy than POD and other neural operators on heterogeneous coercive problems.
CogAdapt adapts clinical ECG foundation models to 3-lead wearable signals for cognitive load assessment via a LeadBridge adapter and ProFine progressive fine-tuning, outperforming scratch-trained models with macro-F1 of 0.626 and 0.768 on public datasets under leave-one-subject-out validation.
Q-PhotoNAS applies genetic algorithm search to jointly optimize classical preprocessing, phase encoding, and photonic circuit structure for hybrid quantum-classical models, reporting 99.44% and 98.78% accuracy on Digits and MNIST with projected photonic QPU inference times.
Dual HRKAN framework (DPIKAN-TO) for topology optimization with one network predicting displacements and another handling sensitivity-based design updates.
Manifold constraints via the new MACRO optimizer independently bound activation scales and enforce rotational equilibrium in LLM pre-training, subsuming RMS normalization and decoupled weight decay while delivering competitive performance with convergence guarantees.
citing papers explorer
-
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
A noisy top-k gated mixture-of-experts layer between LSTMs scales neural networks to 137B parameters with sub-linear compute, beating SOTA on language modeling and machine translation.
-
Density estimation using Real NVP
Real NVP uses affine coupling layers to create invertible transformations that support exact density estimation, sampling, and latent inference without approximations.
-
Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks
DCGANs with architectural constraints learn a hierarchy of representations from object parts to scenes in both generator and discriminator across image datasets.
-
Determining star formation histories and age-metallicity relations with convolutional neural networks
A CNN with attention and shared latent space recovers SFHs and metallicities from spectro-photometric data with ~0.12 dex age and ~0.03 dex metallicity dispersion while running thousands of times faster than full spectral fitting.
-
Rethinking the Need for Source Models: Source-Free Domain Adaptation from Scratch Guided by a Vision-Language Model
The paper introduces the VODA setting for domain adaptation from scratch using vision-language models and presents TS-DRD, which achieves competitive performance on standard benchmarks without source models.
-
Physics-informed, Generative Adversarial Design of Funicular Shells
A modified DCGAN with an auxiliary discriminator using the membrane factor generates stable, previously unseen funicular shells optimized for pure compression in three dimensions.
-
Deep Learning for CMB Foreground Removal and Beam Deconvolution: A U-Net GAN Approach
A U-Net GAN reconstructs CMB T and E maps from Planck-like simulations with foregrounds and systematics, achieving under 1% error outside the Galactic region and demonstrating first-time correction for non-circular beams and asymmetric scans.
-
High Fidelity Neural Audio Compression
EnCodec is an end-to-end trained streaming neural audio codec that uses a single multiscale spectrogram discriminator and a gradient-normalizing loss balancer to achieve higher fidelity than prior methods at the same bitrates for 24 kHz mono and 48 kHz stereo audio.
-
A Simple Framework for Contrastive Learning of Visual Representations
SimCLR learns visual representations by contrasting augmented views of the same image and reaches 76.5% ImageNet top-1 accuracy with a linear classifier, matching a supervised ResNet-50.
-
MVB: A Large-Scale Dataset for Baggage Re-Identification and Merged Siamese Networks
Releases MVB, a multi-view baggage re-identification dataset with 4519 identities and 22660 images, plus a merged Siamese network baseline evaluated on it.
-
Learning to learn with quantum neural networks via classical neural networks
Classical RNNs trained on small instances provide parameter initializations for QAOA and VQE that reduce total optimization iterations and generalize across problem sizes.
-
IRNet: A General Purpose Deep Residual Regression Framework for Materials Discovery
IRNet uses per-layer residual shortcuts in fully connected networks to achieve better prediction accuracy and training convergence than prior ML methods on OQMD and Materials Project datasets for material properties.
-
Importance Estimation for Neural Network Pruning
Taylor-expansion importance scoring enables layer-agnostic pruning of neural networks that outperforms prior methods on ImageNet accuracy-FLOPs trade-offs.
-
Progressive Growing of GANs for Improved Quality, Stability, and Variation
Progressive growing stabilizes GAN training to produce high-resolution images of unprecedented quality and achieves a record unsupervised inception score of 8.80 on CIFAR10.
-
The Kinetics Human Action Video Dataset
Kinetics is a new video dataset of 400 human actions with over 160000 ten-second clips collected from YouTube, accompanied by baseline action-classification results from neural networks.
-
MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications
MobileNets introduce depthwise separable convolutions plus width and resolution multipliers to produce efficient CNNs that trade off latency and accuracy for mobile and embedded vision applications.
-
Continuous control with deep reinforcement learning
DDPG is a model-free actor-critic algorithm that learns continuous control policies end-to-end from states or pixels using deterministic policy gradients and deep networks, solving more than 20 physics tasks competitively with full-information planning methods.
-
LSUN: Construction of a Large-scale Image Dataset using Deep Learning with Humans in the Loop
LSUN dataset of one million images per category across 30 classes is constructed via iterative human-in-the-loop deep learning labeling.
-
A Dual-domain Refinement Network with FBP-based Jacobian Learning for Sparse-view Dual-Energy CT Material Decomposition
DECT-DRNet combines an FBP-based learnable Jacobian approximation with dual-domain Fourier regularization to improve accuracy of multi-material decomposition from sparse-view dual-energy CT data.
-
IV-Net: A neural network for elliptic PDEs with random and highly varying coefficients
IV-Net is a multigrid-inspired convolutional neural operator that approximates solutions to linear elliptic PDEs with high-contrast coefficients and shows better accuracy than POD and other neural operators on heterogeneous coercive problems.
-
CogAdapt: Transferring Clinical ECG Foundation Models to Wearable Cognitive Load Assessment via Lead Adaptation
CogAdapt adapts clinical ECG foundation models to 3-lead wearable signals for cognitive load assessment via a LeadBridge adapter and ProFine progressive fine-tuning, outperforming scratch-trained models with macro-F1 of 0.626 and 0.768 on public datasets under leave-one-subject-out validation.
-
Q-PhotoNAS: Hybrid Quantum Neural Architecture Search Framework on Photonic Devices
Q-PhotoNAS applies genetic algorithm search to jointly optimize classical preprocessing, phase encoding, and photonic circuit structure for hybrid quantum-classical models, reporting 99.44% and 98.78% accuracy on Digits and MNIST with projected photonic QPU inference times.
-
A Dual Physics-Informed Kolmogorov-Arnold Neural Network Framework for Continuum Topology Optimization
Dual HRKAN framework (DPIKAN-TO) for topology optimization with one network predicting displacements and another handling sensitivity-based design updates.
-
Demystifying Manifold Constraints in LLM Pre-training
Manifold constraints via the new MACRO optimizer independently bound activation scales and enforce rotational equilibrium in LLM pre-training, subsuming RMS normalization and decoupled weight decay while delivering competitive performance with convergence guarantees.
-
Gated Normalization Removal and Scale Anchoring in Pre-Norm Transformers
TaperNorm gradually removes internal normalization in pre-norm transformers via learned gates that reach zero, revealing final norm as a scale anchor and enabling up to 1.18x faster KV-cached decoding with small loss increases.
-
TriagerX: Dual Transformers for Bug Triaging Tasks with Content and Interaction Based Rankings
TriagerX combines dual-transformer content rankings with developer interaction history to improve top-k accuracy for developer and component recommendations in bug triaging across five datasets.
-
Scalable Equilibrium Propagation via Intermediate Error Signals for Deep Convolutional CRNNs
Introduces layer-wise learning signals combining knowledge distillation and local errors into Equilibrium Propagation, enabling scalable training of deep VGG-style CRNNs with SOTA results on CIFAR-10 and CIFAR-100.
-
Revisiting Feature Prediction for Learning Visual Representations from Video
V-JEPA models trained only on feature prediction from 2 million public videos achieve 81.9% on Kinetics-400, 72.2% on Something-Something-v2, and 77.9% on ImageNet-1K using frozen ViT-H/16 backbones.
-
Sharpness-Aware Minimization for Efficiently Improving Generalization
SAM solves a min-max problem to locate flat low-loss regions, improving generalization on CIFAR, ImageNet and label-noise tasks.
-
Unsupervised Learning Framework of Interest Point Via Properties Optimization
Unsupervised EM-based joint optimization of interest point detector and descriptor via probability formulations of sparsity, repeatability and discriminability, yielding Property Network that outperforms SOTA on matching benchmarks without retraining.
-
Segmenting Objects in Day and Night:Edge-Conditioned CNN for Thermal Image Semantic Segmentation
EC-CNN uses a gated feature-wise transform to incorporate edge priors for thermal semantic segmentation and introduces the SODA dataset of over 7,000 labeled thermal images.
-
A Deep Learning System for Predicting Size and Fit in Fashion E-Commerce
A deep learning content-collaborative model for size and fit prediction that outperforms state-of-the-art on two public and two proprietary datasets.
-
Interaction-and-Aggregation Network for Person Re-identification
Introduces IA network with SIA and CIA modules to adaptively model spatial and channel feature interdependencies for improved person re-identification on benchmarks.
-
QUOTIENT: Two-Party Secure Neural Network Training and Prediction
QUOTIENT achieves 50X faster WAN training time and 6% higher absolute accuracy for secure two-party DNN training by jointly optimizing a discretized training algorithm with a tailored secure protocol.
-
Adaptive Weighting Depth-variant Deconvolution of Fluorescence Microscopy Images with Convolutional Neural Network
A CNN predicts depth-variant PSFs for patch-wise deconvolution of fluorescence microscopy images, with adaptive weighting to reduce artifacts, claiming 98.2% accuracy and up to 6.6 dB PSNR gain.
-
Graph-based Knowledge Distillation by Multi-head Attention Network
Multi-head attention constructs a graph of dataset relations from the teacher embedding procedure and transfers it to the student via multi-task learning, yielding 7.05% higher CIFAR-100 accuracy than the student alone and 2.46% above prior SOTA.
-
Generalizing from a few environments in safety-critical reinforcement learning
RL agents fail dangerously on unseen environments; ensembles reduce catastrophes in gridworld but not CoinRun, with uncertainty enabling intervention prediction.
-
Rethinking Atrous Convolution for Semantic Image Segmentation
DeepLabv3 improves semantic segmentation by capturing multi-scale context with cascaded or parallel atrous convolutions and adding global context to ASPP, achieving better results on PASCAL VOC 2012 without DenseCRF post-processing.
-
On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima
Large-batch methods converge to sharp minima causing a generalization gap, while small-batch methods reach flat minima due to inherent gradient noise.
-
Quantum Algorithm for Distributed Reduction of Entanglements (QADR): A Trainable and Simulation-Efficient QML Framework
QADR decomposes n-qubit VQCs into local sub-circuits to reduce memory from O(2^n) to O(n * 2^{2d+1}) and mitigate barren plateaus, scaling to 2000 features on MNIST and wind turbine diagnostics while matching classical models.
-
Negligible in Size, Significant in Effect: On Scale Vectors in Large Language Models
Scale vectors in Pre-Norm LLMs aid optimization via preconditioning on linear layers rather than expressivity, and three lightweight modifications to them reduce terminal loss across model scales.
-
Unveiling Hidden Lyman Alpha Emitters in the DESI DR1 Data
A CNN detects 19,685 LAEs at z=2-3.5 in DESI DR1 spectra with 95% purity and completeness.
-
A sound-horizon-free measurement of the Hubble constant from DESI DR2 baryon acoustic oscillations using artificial neural networks
Neural network reconstruction of DESI DR2 BAO, SNe Ia, and cosmic chronometer data gives H0 = 71.5 ± 2.2 km s^{-1} Mpc^{-1} without sound horizon input.
-
Distributional Value Estimation Without Target Networks for Robust Quality-Diversity
QDHUAC is a distributional, target-free QD-RL method that enables stable high-UTD training and competitive performance on Brax locomotion tasks using far fewer environment steps than prior approaches.
-
Enhancing Event Reconstruction in Hyper-Kamiokande with Machine Learning: A ResNet Implementation
ResNet models classify four particle types and regress vertex, direction, and momentum in Hyper-Kamiokande with resolutions matching likelihood methods but at 30,000-50,000x faster inference on GPU.
-
Probabilistic Hysteresis Factor Prediction for Electric Vehicle Batteries with Graphite Anodes Containing Silicon
A data-driven probabilistic approach predicts the hysteresis factor for silicon-graphite anode batteries in electric vehicles, with tests for generalization across vehicle models.
-
DoSReMC: Domain Shift Resilient Mammography Classification using Batch Normalization Adaptation
DoSReMC improves cross-domain generalization in mammography classification by fine-tuning only batch normalization and fully connected layers of pretrained CNNs while preserving convolutional filters, combined with adversarial training.
-
Model-independent calibration of Gamma-Ray Bursts with neural networks
Neural networks calibrate 2D and 3D Dainotti relations on the Platinum GRB sample via ANN-driven MCMC to produce a model-independent Hubble diagram with reduced scatter.
-
YOLOv4: Optimal Speed and Accuracy of Object Detection
YOLOv4 achieves 43.5% AP (65.7% AP50) on MS COCO at ~65 FPS on Tesla V100 by integrating WRC, CSP, CmBN, SAT, Mish activation, Mosaic augmentation, DropBlock, and CIoU loss.
-
Product Image Recognition with Guidance Learning and Noisy Supervision
Presents the Product-90 noisy product image dataset and a guidance learning method that combines noisy labels with teacher soft labels to train CNNs, reporting gains over prior methods on Product-90 and three public noisy datasets.