A noisy top-k gated mixture-of-experts layer between LSTMs scales neural networks to 137B parameters with sub-linear compute, beating SOTA on language modeling and machine translation.
hub Mixed citations
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
Mixed citation behavior. Most common role is method (54%).
abstract
Training Deep Neural Networks is complicated by the fact that the distribution of each layer's inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities. We refer to this phenomenon as internal covariate shift, and address the problem by normalizing layer inputs. Our method draws its strength from making normalization a part of the model architecture and performing the normalization for each training mini-batch. Batch Normalization allows us to use much higher learning rates and be less careful about initialization. It also acts as a regularizer, in some cases eliminating the need for Dropout. Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin. Using an ensemble of batch-normalized networks, we improve upon the best published result on ImageNet classification: reaching 4.9% top-5 validation error (and 4.8% test error), exceeding the accuracy of human raters.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract Training Deep Neural Networks is complicated by the fact that the distribution of each layer's inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities. We refer to this phenomenon as internal covariate shift, and address the problem by normalizing layer inputs. Our method draws its strength from making normalization a part of the model architecture and performing the normalization for each trai
- method convolution and a 3 × 3 max-pool, followed by four stages of residual blocks with channel depths {64, 128, 256, 512}. An adaptive-average-pooling layer reduces the spatial dimensions, after which a fully connected layer projects to the required out- put dimension. Kaiming-normal weight initialization [26] and zero-initialized residual-branch batch norms [27] are applied. ResNet-50 networks were also evaluated but yielded infe- rior reconstruction performance relative to ResNet-152, despite offer
- background the resource restrictions (latency, size) for their application. MobileNets primarily focus on optimizing for latency but also yield small networks. Many papers on small networks focus only on size but do not consider speed. MobileNets are built primarily from depthwise separable convolutions initially introduced in [26] and subsequently used in Inception models [13] to reduce the computation in the first few layers. Flattened networks [16] build a network out of fully factorized convolutions and
- method powerful class of bijective functions which enable exact and tractable density evaluation and exact and tractable inference. Moreover, the resulting cost function does not to rely on a fixed form reconstruction cost such as square error [38, 47], and generates sharper samples as a result. Also, this flexibility helps us leverage recent advances in batch normalization [31] and residual networks [24, 25] to define a very deep multi-scale architecture with multiple levels of abstraction. 3.1 Change of
- method To be concrete, we duplicate several copies of the original last block in ResNet [32] and arrange them in cascade, and also revisit the ASPP module [ 11] which contains several atrous convolutions in parallel. Note that our cascaded mod- ules are applied directly on the feature maps instead of belief maps. For the proposed modules, we experimentally find it important to train with batch normalization [38]. To further capture global context, we propose to augment ASPP with image-level features, si
- method times they are cropped, resized and generally pre-processed in different ways (but, nevertheless, the image classifier could localize the same clip). So even though each clip is from a distinct video there were still duplications. We devised a process for de-duplicating across YouTube links which operated independently for each class. First we computed Inception-V1 [12] feature vectors (taken after last average pooling layer) on 224 × 224 center crops of 25 uni- formly sampled frames from each vi
- background Fei-Fei, R. Fergus, and P. Perona. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. CVIU, 106(1):59-70, 2007. [7] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. arXiv preprint arXiv:1502.01852, 2015. [8] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint ar
co-cited works
representative citing papers
Real NVP uses affine coupling layers to create invertible transformations that support exact density estimation, sampling, and latent inference without approximations.
DCGANs with architectural constraints learn a hierarchy of representations from object parts to scenes in both generator and discriminator across image datasets.
A CNN with attention and shared latent space recovers SFHs and metallicities from spectro-photometric data with ~0.12 dex age and ~0.03 dex metallicity dispersion while running thousands of times faster than full spectral fitting.
The paper introduces the VODA setting for domain adaptation from scratch using vision-language models and presents TS-DRD, which achieves competitive performance on standard benchmarks without source models.
A modified DCGAN with an auxiliary discriminator using the membrane factor generates stable, previously unseen funicular shells optimized for pure compression in three dimensions.
A U-Net GAN reconstructs CMB T and E maps from Planck-like simulations with foregrounds and systematics, achieving under 1% error outside the Galactic region and demonstrating first-time correction for non-circular beams and asymmetric scans.
EnCodec is an end-to-end trained streaming neural audio codec that uses a single multiscale spectrogram discriminator and a gradient-normalizing loss balancer to achieve higher fidelity than prior methods at the same bitrates for 24 kHz mono and 48 kHz stereo audio.
SimCLR learns visual representations by contrasting augmented views of the same image and reaches 76.5% ImageNet top-1 accuracy with a linear classifier, matching a supervised ResNet-50.
Releases MVB, a multi-view baggage re-identification dataset with 4519 identities and 22660 images, plus a merged Siamese network baseline evaluated on it.
Classical RNNs trained on small instances provide parameter initializations for QAOA and VQE that reduce total optimization iterations and generalize across problem sizes.
IRNet uses per-layer residual shortcuts in fully connected networks to achieve better prediction accuracy and training convergence than prior ML methods on OQMD and Materials Project datasets for material properties.
Taylor-expansion importance scoring enables layer-agnostic pruning of neural networks that outperforms prior methods on ImageNet accuracy-FLOPs trade-offs.
Progressive growing stabilizes GAN training to produce high-resolution images of unprecedented quality and achieves a record unsupervised inception score of 8.80 on CIFAR10.
Kinetics is a new video dataset of 400 human actions with over 160000 ten-second clips collected from YouTube, accompanied by baseline action-classification results from neural networks.
MobileNets introduce depthwise separable convolutions plus width and resolution multipliers to produce efficient CNNs that trade off latency and accuracy for mobile and embedded vision applications.
DDPG is a model-free actor-critic algorithm that learns continuous control policies end-to-end from states or pixels using deterministic policy gradients and deep networks, solving more than 20 physics tasks competitively with full-information planning methods.
LSUN dataset of one million images per category across 30 classes is constructed via iterative human-in-the-loop deep learning labeling.
First NLO-QCD amplitude-assisted ML regression for longitudinal-boson production rate in di-boson events at the LHC, benchmarked against random forests.
DECT-DRNet combines an FBP-based learnable Jacobian approximation with dual-domain Fourier regularization to improve accuracy of multi-material decomposition from sparse-view dual-energy CT data.
A modified graph convolutional isomorphism network predicts polynomial coefficients for a sparse pseudo-inverse AMG smoother, cutting V-cycles and delivering 4-37% wall-clock speedups while generalizing to larger and unseen meshes.
IV-Net is a multigrid-inspired convolutional neural operator that approximates solutions to linear elliptic PDEs with high-contrast coefficients and shows better accuracy than POD and other neural operators on heterogeneous coercive problems.
CogAdapt adapts clinical ECG foundation models to 3-lead wearable signals for cognitive load assessment via a LeadBridge adapter and ProFine progressive fine-tuning, outperforming scratch-trained models with macro-F1 of 0.626 and 0.768 on public datasets under leave-one-subject-out validation.
Q-PhotoNAS applies genetic algorithm search to jointly optimize classical preprocessing, phase encoding, and photonic circuit structure for hybrid quantum-classical models, reporting 99.44% and 98.78% accuracy on Digits and MNIST with projected photonic QPU inference times.
citing papers explorer
-
Learning to learn with quantum neural networks via classical neural networks
Classical RNNs trained on small instances provide parameter initializations for QAOA and VQE that reduce total optimization iterations and generalize across problem sizes.
-
Q-PhotoNAS: Hybrid Quantum Neural Architecture Search Framework on Photonic Devices
Q-PhotoNAS applies genetic algorithm search to jointly optimize classical preprocessing, phase encoding, and photonic circuit structure for hybrid quantum-classical models, reporting 99.44% and 98.78% accuracy on Digits and MNIST with projected photonic QPU inference times.
-
Quantum Algorithm for Distributed Reduction of Entanglements (QADR): A Trainable and Simulation-Efficient QML Framework
QADR decomposes n-qubit VQCs into local sub-circuits to reduce memory from O(2^n) to O(n * 2^{2d+1}) and mitigate barren plateaus, scaling to 2000 features on MNIST and wind turbine diagnostics while matching classical models.