A noisy top-k gated mixture-of-experts layer between LSTMs scales neural networks to 137B parameters with sub-linear compute, beating SOTA on language modeling and machine translation.
hub
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
23 Pith papers cite this work. Polarity classification is still indexing.
abstract
Training Deep Neural Networks is complicated by the fact that the distribution of each layer's inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities. We refer to this phenomenon as internal covariate shift, and address the problem by normalizing layer inputs. Our method draws its strength from making normalization a part of the model architecture and performing the normalization for each training mini-batch. Batch Normalization allows us to use much higher learning rates and be less careful about initialization. It also acts as a regularizer, in some cases eliminating the need for Dropout. Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin. Using an ensemble of batch-normalized networks, we improve upon the best published result on ImageNet classification: reaching 4.9% top-5 validation error (and 4.8% test error), exceeding the accuracy of human raters.
hub tools
citation-role summary
citation-polarity summary
roles
method 1polarities
use method 1representative citing papers
Real NVP uses affine coupling layers to create invertible transformations that support exact density estimation, sampling, and latent inference without approximations.
DCGANs with architectural constraints learn a hierarchy of representations from object parts to scenes in both generator and discriminator across image datasets.
The paper introduces the VODA setting for domain adaptation from scratch using vision-language models and presents TS-DRD, which achieves competitive performance on standard benchmarks without source models.
A modified DCGAN with an auxiliary discriminator using the membrane factor generates stable, previously unseen funicular shells optimized for pure compression in three dimensions.
EnCodec is an end-to-end trained streaming neural audio codec that uses a single multiscale spectrogram discriminator and a gradient-normalizing loss balancer to achieve higher fidelity than prior methods at the same bitrates for 24 kHz mono and 48 kHz stereo audio.
SimCLR learns visual representations by contrasting augmented views of the same image and reaches 76.5% ImageNet top-1 accuracy with a linear classifier, matching a supervised ResNet-50.
Progressive growing stabilizes GAN training to produce high-resolution images of unprecedented quality and achieves a record unsupervised inception score of 8.80 on CIFAR10.
Kinetics is a new video dataset of 400 human actions with over 160000 ten-second clips collected from YouTube, accompanied by baseline action-classification results from neural networks.
MobileNets introduce depthwise separable convolutions plus width and resolution multipliers to produce efficient CNNs that trade off latency and accuracy for mobile and embedded vision applications.
DDPG is a model-free actor-critic algorithm that learns continuous control policies end-to-end from states or pixels using deterministic policy gradients and deep networks, solving more than 20 physics tasks competitively with full-information planning methods.
LSUN dataset of one million images per category across 30 classes is constructed via iterative human-in-the-loop deep learning labeling.
Manifold constraints via the new MACRO optimizer independently bound activation scales and enforce rotational equilibrium in LLM pre-training, subsuming RMS normalization and decoupled weight decay while delivering competitive performance with convergence guarantees.
V-JEPA models trained only on feature prediction from 2 million public videos achieve 81.9% on Kinetics-400, 72.2% on Something-Something-v2, and 77.9% on ImageNet-1K using frozen ViT-H/16 backbones.
DeepLabv3 improves semantic segmentation by capturing multi-scale context with cascaded or parallel atrous convolutions and adding global context to ASPP, achieving better results on PASCAL VOC 2012 without DenseCRF post-processing.
Large-batch methods converge to sharp minima causing a generalization gap, while small-batch methods reach flat minima due to inherent gradient noise.
A CNN detects 19,685 LAEs at z=2-3.5 in DESI DR1 spectra with 95% purity and completeness.
Neural network reconstruction of DESI DR2 BAO, SNe Ia, and cosmic chronometer data gives H0 = 71.5 ± 2.2 km s^{-1} Mpc^{-1} without sound horizon input.
QDHUAC is a distributional, target-free QD-RL method that enables stable high-UTD training and competitive performance on Brax locomotion tasks using far fewer environment steps than prior approaches.
ResNet models classify four particle types and regress vertex, direction, and momentum in Hyper-Kamiokande with resolutions matching likelihood methods but at 30,000-50,000x faster inference on GPU.
YOLOv4 achieves 43.5% AP (65.7% AP50) on MS COCO at ~65 FPS on Tesla V100 by integrating WRC, CSP, CmBN, SAT, Mish activation, Mosaic augmentation, DropBlock, and CIoU loss.
A conditional Wasserstein GAN generates plausible future SWI drought trajectories for French insurance risk management under climate change.
RadarCNN classifies indoor objects from radar IQ data at 97-99% accuracy, holding at ~50% under noise and occlusion.
citing papers explorer
-
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
A noisy top-k gated mixture-of-experts layer between LSTMs scales neural networks to 137B parameters with sub-linear compute, beating SOTA on language modeling and machine translation.
-
Density estimation using Real NVP
Real NVP uses affine coupling layers to create invertible transformations that support exact density estimation, sampling, and latent inference without approximations.
-
Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks
DCGANs with architectural constraints learn a hierarchy of representations from object parts to scenes in both generator and discriminator across image datasets.
-
Rethinking the Need for Source Models: Source-Free Domain Adaptation from Scratch Guided by a Vision-Language Model
The paper introduces the VODA setting for domain adaptation from scratch using vision-language models and presents TS-DRD, which achieves competitive performance on standard benchmarks without source models.
-
Physics-informed, Generative Adversarial Design of Funicular Shells
A modified DCGAN with an auxiliary discriminator using the membrane factor generates stable, previously unseen funicular shells optimized for pure compression in three dimensions.
-
High Fidelity Neural Audio Compression
EnCodec is an end-to-end trained streaming neural audio codec that uses a single multiscale spectrogram discriminator and a gradient-normalizing loss balancer to achieve higher fidelity than prior methods at the same bitrates for 24 kHz mono and 48 kHz stereo audio.
-
A Simple Framework for Contrastive Learning of Visual Representations
SimCLR learns visual representations by contrasting augmented views of the same image and reaches 76.5% ImageNet top-1 accuracy with a linear classifier, matching a supervised ResNet-50.
-
Progressive Growing of GANs for Improved Quality, Stability, and Variation
Progressive growing stabilizes GAN training to produce high-resolution images of unprecedented quality and achieves a record unsupervised inception score of 8.80 on CIFAR10.
-
The Kinetics Human Action Video Dataset
Kinetics is a new video dataset of 400 human actions with over 160000 ten-second clips collected from YouTube, accompanied by baseline action-classification results from neural networks.
-
MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications
MobileNets introduce depthwise separable convolutions plus width and resolution multipliers to produce efficient CNNs that trade off latency and accuracy for mobile and embedded vision applications.
-
Continuous control with deep reinforcement learning
DDPG is a model-free actor-critic algorithm that learns continuous control policies end-to-end from states or pixels using deterministic policy gradients and deep networks, solving more than 20 physics tasks competitively with full-information planning methods.
-
LSUN: Construction of a Large-scale Image Dataset using Deep Learning with Humans in the Loop
LSUN dataset of one million images per category across 30 classes is constructed via iterative human-in-the-loop deep learning labeling.
-
Demystifying Manifold Constraints in LLM Pre-training
Manifold constraints via the new MACRO optimizer independently bound activation scales and enforce rotational equilibrium in LLM pre-training, subsuming RMS normalization and decoupled weight decay while delivering competitive performance with convergence guarantees.
-
Revisiting Feature Prediction for Learning Visual Representations from Video
V-JEPA models trained only on feature prediction from 2 million public videos achieve 81.9% on Kinetics-400, 72.2% on Something-Something-v2, and 77.9% on ImageNet-1K using frozen ViT-H/16 backbones.
-
Rethinking Atrous Convolution for Semantic Image Segmentation
DeepLabv3 improves semantic segmentation by capturing multi-scale context with cascaded or parallel atrous convolutions and adding global context to ASPP, achieving better results on PASCAL VOC 2012 without DenseCRF post-processing.
-
On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima
Large-batch methods converge to sharp minima causing a generalization gap, while small-batch methods reach flat minima due to inherent gradient noise.
-
Unveiling Hidden Lyman Alpha Emitters in the DESI DR1 Data
A CNN detects 19,685 LAEs at z=2-3.5 in DESI DR1 spectra with 95% purity and completeness.
-
A sound-horizon-free measurement of the Hubble constant from DESI DR2 baryon acoustic oscillations using artificial neural networks
Neural network reconstruction of DESI DR2 BAO, SNe Ia, and cosmic chronometer data gives H0 = 71.5 ± 2.2 km s^{-1} Mpc^{-1} without sound horizon input.
-
Distributional Value Estimation Without Target Networks for Robust Quality-Diversity
QDHUAC is a distributional, target-free QD-RL method that enables stable high-UTD training and competitive performance on Brax locomotion tasks using far fewer environment steps than prior approaches.
-
Enhancing Event Reconstruction in Hyper-Kamiokande with Machine Learning: A ResNet Implementation
ResNet models classify four particle types and regress vertex, direction, and momentum in Hyper-Kamiokande with resolutions matching likelihood methods but at 30,000-50,000x faster inference on GPU.
-
YOLOv4: Optimal Speed and Accuracy of Object Detection
YOLOv4 achieves 43.5% AP (65.7% AP50) on MS COCO at ~65 FPS on Tesla V100 by integrating WRC, CSP, CmBN, SAT, Mish activation, Mosaic augmentation, DropBlock, and CIoU loss.
-
A Wasserstein GAN-based climate scenario generator for risk management and insurance: the case of soil subsidence
A conditional Wasserstein GAN generates plausible future SWI drought trajectories for French insurance risk management under climate change.
-
RadarCNN: Learning-based Indoor Object Classification from IQ Imaging Radar Data
RadarCNN classifies indoor objects from radar IQ data at 97-99% accuracy, holding at ~50% under noise and occlusion.