pith. machine review for the scientific record. sign in

arxiv: 1607.06450 · v1 · submitted 2016-07-21 · 📊 stat.ML · cs.LG

Recognition: unknown

Layer Normalization

Geoffrey E. Hinton, Jamie Ryan Kiros, Jimmy Lei Ba

classification 📊 stat.ML cs.LG
keywords normalizationtraininglayerbatchnetworksneuraltimeneuron
0
0 comments X
read the original abstract

Training state-of-the-art, deep neural networks is computationally expensive. One way to reduce the training time is to normalize the activities of the neurons. A recently introduced technique called batch normalization uses the distribution of the summed input to a neuron over a mini-batch of training cases to compute a mean and variance which are then used to normalize the summed input to that neuron on each training case. This significantly reduces the training time in feed-forward neural networks. However, the effect of batch normalization is dependent on the mini-batch size and it is not obvious how to apply it to recurrent neural networks. In this paper, we transpose batch normalization into layer normalization by computing the mean and variance used for normalization from all of the summed inputs to the neurons in a layer on a single training case. Like batch normalization, we also give each neuron its own adaptive bias and gain which are applied after the normalization but before the non-linearity. Unlike batch normalization, layer normalization performs exactly the same computation at training and test times. It is also straightforward to apply to recurrent neural networks by computing the normalization statistics separately at each time step. Layer normalization is very effective at stabilizing the hidden state dynamics in recurrent networks. Empirically, we show that layer normalization can substantially reduce the training time compared with previously published techniques.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. MinMax Recurrent Neural Cascades

    cs.LG 2026-05 conditional novelty 8.0

    MinMax RNCs are recurrent neural models using min-max recurrence that achieve full regular-language expressivity, logarithmic parallel evaluation, uniformly bounded states, and constant state gradients independent of ...

  2. Characterizing the Expressivity of Local Attention in Transformers

    cs.CL 2026-05 unverdicted novelty 8.0

    Local attention strictly enlarges the class of regular languages recognizable by fixed-precision transformers by adding a second past operator in linear temporal logic, with global and local attention being expressive...

  3. Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    cs.LG 2023-12 unverdicted novelty 8.0

    Mamba is a linear-time sequence model using input-dependent selective SSMs that achieves SOTA results across modalities and matches twice-larger Transformers on language modeling with 5x higher inference throughput.

  4. The Pile: An 800GB Dataset of Diverse Text for Language Modeling

    cs.CL 2020-12 conditional novelty 8.0

    The Pile is a newly constructed 825 GiB dataset from 22 diverse sources that enables language models to achieve better performance on academic, professional, and cross-domain tasks than models trained on Common Crawl ...

  5. Reformer: The Efficient Transformer

    cs.LG 2020-01 accept novelty 8.0

    Reformer matches standard Transformer accuracy on long sequences while using far less memory and running faster via LSH attention and reversible residual layers.

  6. Trajectory-Agnostic Asteroid Detection in TESS with Deep Learning

    astro-ph.EP 2026-05 unverdicted novelty 7.0

    A W-Net deep learning model detects asteroids in TESS data independently of trajectory by rotating training image cubes and using adaptive normalization for data scaling.

  7. QAP-Router: Tackling Qubit Routing as Dynamic Quadratic Assignment with Reinforcement Learning

    quant-ph 2026-05 unverdicted novelty 7.0

    QAP-Router models qubit routing as dynamic QAP and applies RL with a solution-aware Transformer to cut CNOT counts by 12-30% versus industry compilers on real circuit benchmarks.

  8. Learning Less Is More: Premature Upper-Layer Attention Specialization Hurts Language Model Pretraining

    cs.CL 2026-05 unverdicted novelty 7.0

    Temporarily reducing the learning rate on upper-layer query and key projections during early GPT pretraining prevents premature attention specialization and improves model performance.

  9. OpenSGA: Efficient 3D Scene Graph Alignment in the Open World

    cs.CV 2026-05 conditional novelty 7.0

    OpenSGA fuses vision-language, textual, and geometric features via a distance-gated attention encoder and minimum-cost-flow allocator to outperform prior methods on both frame-to-scan and subscan-to-subscan 3D scene g...

  10. Meta-Black-Box Optimization Can Do Search Guidance for Expensive Constrained Multi-Objective Optimization

    cs.NE 2026-05 unverdicted novelty 7.0

    MetaSG-SAEA is a bi-level meta-BBO framework that uses a meta-policy for search guidance via the MM-CCI constraint abstraction and diffusion-based population initialization to outperform baselines on expensive constra...

  11. Why Do Aligned LLMs Remain Jailbreakable: Refusal-Escape Directions, Operator-Level Sources, and Safety-Utility Trade-off

    cs.CR 2026-05 unverdicted novelty 7.0

    Aligned LLMs exhibit Refusal-Escape Directions (RED) that enable refusal-to-answer transitions via input perturbations; these directions decompose exactly into operator-level sources, creating an inherent safety-utili...

  12. Delta-Adapter: Scalable Exemplar-Based Image Editing with Single-Pair Supervision

    cs.CV 2026-05 unverdicted novelty 7.0

    Delta-Adapter extracts a semantic delta from a single image pair via a pre-trained vision encoder and injects it through a Perceiver adapter to enable scalable single-pair supervised editing.

  13. Neural network quantum states in the grand canonical ensemble

    quant-ph 2026-05 unverdicted novelty 7.0

    A new neural quantum state ansatz for bosons in the grand canonical ensemble achieves competitive variational energies in 1D and 2D systems and provides access to one-body reduced density matrices.

  14. QuadNorm: Resolution-Robust Normalization for Neural Operators

    cs.LG 2026-05 unverdicted novelty 7.0

    QuadNorm uses quadrature-based moments instead of uniform averaging in normalization layers, achieving O(h²) consistency across resolutions and better cross-resolution transfer in neural operators.

  15. GPROF-IR: An Improved Single-Channel Infrared Precipitation Retrieval for Merged Satellite Precipitation Products

    physics.ao-ph 2026-05 unverdicted novelty 7.0

    GPROF-IR is a CNN-based retrieval that uses temporal context in geostationary IR observations to produce precipitation estimates with lower error than prior IR methods and climatological consistency with PMW retrieval...

  16. Solving Max-Cut to Global Optimality via Feasibility-Preserving Graph Neural Networks

    cs.LG 2026-05 unverdicted novelty 7.0

    A Max-Cut-specific graph neural network predicts primal- and dual-feasible SDP solutions in linearithmic time, cutting bounding costs in exact branch-and-bound by up to 10.6 times versus a commercial SDP solver while ...

  17. Every Feedforward Neural Network Definable in an o-Minimal Structure Has Finite Sample Complexity

    stat.ML 2026-05 unverdicted novelty 7.0

    Every fixed finite feedforward neural network definable in an o-minimal structure has finite sample complexity in the agnostic PAC setting.

  18. SOPE: Stabilizing Off-Policy Evaluation for Online RL with Prior Data

    cs.LG 2026-05 unverdicted novelty 7.0

    SOPE uses an actor-aligned OPE signal on a held-out validation split to dynamically stop offline stabilization phases in online RL, improving performance up to 45.6% and cutting TFLOPs up to 22x on 25 Minari tasks.

  19. How Long Does Infinite Width Last? Signal Propagation in Long-Range Linear Recurrences

    cs.LG 2026-05 unverdicted novelty 7.0

    In linear recurrent models, infinite-width signal propagation remains accurate only for depths t much smaller than sqrt(width n), with a critical regime at t ~ c sqrt(n) where finite-width effects emerge and dominate ...

  20. Gradient Flow Structure and Quantitative Dynamics of Multi-Head Self-Attention

    cs.LG 2026-05 unverdicted novelty 7.0

    Multi-head self-attention is modeled as a gradient flow with a non-decreasing energy functional under conditions on score matrices, yielding closed-form clustering thresholds in simplified regimes and monotonic entrop...

  21. PHALAR: Phasors for Learned Musical Audio Representations

    cs.SD 2026-05 unverdicted novelty 7.0

    PHALAR introduces a contrastive audio representation framework with spectral pooling and complex-valued processing that sets new state-of-the-art results in stem retrieval on MoisesDB, Slakh, and ChocoChorales while a...

  22. PHALAR: Phasors for Learned Musical Audio Representations

    cs.SD 2026-05 unverdicted novelty 7.0

    PHALAR introduces a phasor-based contrastive framework with learned spectral pooling and complex heads that enforces pitch-equivariant and phase-equivariant biases, delivering up to 70% relative accuracy gains in stem...

  23. PHALAR: Phasors for Learned Musical Audio Representations

    cs.SD 2026-05 unverdicted novelty 7.0

    PHALAR achieves up to 70% relative accuracy gain in stem retrieval with under half the parameters and 7x faster training by using phasor-based equivariant representations, setting new SOTA on multiple datasets.

  24. iGENE: A Differentiable Flux-Tube Gyrokinetic Code in TensorFlow

    physics.plasm-ph 2026-05 unverdicted novelty 7.0

    A fully differentiable TensorFlow gyrokinetic code allows approximate gradients of nonlinear turbulence quantities to be used for outer-loop tasks such as profile prediction despite stochasticity.

  25. Neuron-Anchored Rule Extraction for Large Language Models via Contrastive Hierarchical Ablation

    cs.LG 2026-05 unverdicted novelty 7.0

    MechaRule localizes agonist neurons in LLMs via contrastive hierarchical ablation to ground rule extraction in circuitry, recalling 96.8% of high-effect neurons and reducing task performance when suppressed.

  26. Graph Transformers and Stabilized Reinforcement Learning for Large-Scale Dynamic Routing Modulation and Spectrum Allocation in Elastic Optical Networks

    cs.NI 2026-05 conditional novelty 7.0

    Graph transformer RL for dynamic RMSA supports up to 13% more traffic than benchmarks on networks up to 143 nodes and 362 links.

  27. Evaluating LLMs on Large-Scale Graph Property Estimation via Random Walks

    cs.LG 2026-05 unverdicted novelty 7.0

    EstGraph benchmark evaluates LLMs on estimating properties of very large graphs from random-walk samples that fit in context limits.

  28. ResRL: Boosting LLM Reasoning via Negative Sample Projection Residual Reinforcement Learning

    cs.LG 2026-05 unverdicted novelty 7.0

    ResRL decouples shared semantics between positive and negative responses in LLM reinforcement learning via SVD-based projection residuals, outperforming baselines including NSR by up to 9.4% on math reasoning benchmarks.

  29. DEFault++: Automated Fault Detection, Categorization, and Diagnosis for Transformer Architectures

    cs.SE 2026-04 unverdicted novelty 7.0

    DEFault++ delivers automated hierarchical fault detection, categorization into 12 transformer-specific types, and root-cause diagnosis among 45 mechanisms on a new benchmark of 3,739 mutated instances, with AUROC >0.9...

  30. Learning Neural Operator Surrogates for the Black Hole Accretion Code

    astro-ph.HE 2026-04 unverdicted novelty 7.0

    Physics-informed Fourier neural operators recover plasmoid formation in sparse SRRMHD vortex data where data-only models fail, and transformer operators approximate AMR jet evolution, marking first reported uses in th...

  31. Pareto Frontier of Neural Quantum States: Scalable, Affordable, and Accurate Convolutional Backflow for Strongly Correlated Lattice Fermions

    cond-mat.str-el 2026-04 unverdicted novelty 7.0

    SCALE and ACE are new convolutional backflow architectures for Neural Quantum States that deliver O(N^3) scaling with high accuracy and over 40x speedup on Hubbard and t-J models up to 32x32 lattices.

  32. Reference-Augmented Learning for Precise Tracking Policy of Tendon-Driven Continuum Robots

    cs.RO 2026-04 unverdicted novelty 7.0

    Reference-augmented learning with RNN surrogate and stochastic perturbations cuts average position error by 50.9% for 6-DOF tracking on a three-section TDCR compared to non-augmented baselines.

  33. Query-Efficient Quantum Approximate Optimization via Graph-Conditioned Trust Regions

    cs.LG 2026-04 unverdicted novelty 7.0

    A GNN predicts Gaussians over QAOA parameters to create graph-conditioned trust regions that reduce circuit evaluations for MaxCut from 85-343 down to 45 while keeping approximation ratios within 3 points of heuristics.

  34. Bridging the Sensitivity Gap in Precipitation Estimates from Spaceborne Radars using Passive Microwave Observations

    physics.ao-ph 2026-04 conditional novelty 7.0

    A fused PMW retrieval using CloudSat and GPM radar references improves high-latitude precipitation detection skill by 26% and cuts underestimation by over 50% compared to precipitation-radar-only training.

  35. HAC: Parameter-Efficient Hyperbolic Adaptation of CLIP for Zero-Shot VQA

    cs.CV 2026-04 unverdicted novelty 7.0

    HAC provides a parameter-efficient way to move CLIP into hyperbolic geometry, yielding consistent gains on zero-shot VQA benchmarks without any VQA training data overlap.

  36. A satellite foundation model for improved wealth monitoring

    cs.CY 2026-04 unverdicted novelty 7.0

    Tempov is a self-supervised satellite foundation model that predicts wealth levels and decadal changes at high resolution across Africa from Landsat imagery, outperforming baselines even with limited labels and genera...

  37. Latent Space Probing for Adult Content Detection in Video Generative Models

    cs.CV 2026-04 unverdicted novelty 7.0

    Latent space probing on CogVideoX achieves 97.29% F1 for adult content detection on a new 11k-clip dataset with 4-6ms overhead.

  38. To See the Unseen: on the Generalization Ability of Transformers in Symbolic Reasoning

    cs.AI 2026-04 conditional novelty 7.0

    Unembedding collapse in transformers prevents distinguishing unseen tokens in symbolic reasoning, but targeted interventions restore generalization.

  39. Linear Image Generation by Synthesizing Exposure Brackets

    cs.CV 2026-04 unverdicted novelty 7.0

    The paper introduces a DiT-based flow-matching model that generates linear images by synthesizing text-conditioned exposure brackets to preserve full dynamic range.

  40. Understanding and Enforcing Weight Disentanglement in Task Arithmetic

    cs.AI 2026-04 unverdicted novelty 7.0

    Task-Feature Specialization explains weight disentanglement in task arithmetic and leads to orthogonality, which OrthoReg enforces to enhance performance of model composition methods.

  41. Machine learning isotope shifts in molecular energy levels

    astro-ph.EP 2026-04 unverdicted novelty 7.0

    Neural network corrects residual errors in isotopologue energy extrapolations for CO2 (MAE reduction in >87% of levels vs Marvel) and transfers patterns to improve CO predictions in >93% of samples.

  42. DEMUX: Boundary-Aware Multi-Scale Traffic Demixing for Multi-Tab Website Fingerprinting

    cs.CR 2026-04 unverdicted novelty 7.0

    DEMUX achieves state-of-the-art multi-tab website fingerprinting accuracy by preserving boundary signals, modeling at multiple scales, and associating dispersed traffic fragments with a new three-component architecture.

  43. Sample Is Feature: Beyond Item-Level, Toward Sample-Level Tokens for Unified Large Recommender Models

    cs.IR 2026-04 unverdicted novelty 7.0

    SIF encodes full historical raw samples as tokens via hierarchical quantization to preserve sample context and unify sequential/non-sequential features in large recommender models.

  44. Data-driven oscillator model for multi-frequency turbulent flows

    physics.flu-dyn 2026-04 unverdicted novelty 7.0

    A data-driven framework extracts oscillators from multi-frequency turbulent flow data via autoencoders and models their dynamics with neural networks to enable long-term forecasting, demonstrated on supersonic cavity flow.

  45. One Scale at a Time: Scale-Autoregressive Modeling for Fluid Flow Distributions

    cs.CE 2026-04 conditional novelty 7.0

    Scale-autoregressive modeling (SAR) samples fluid flow distributions hierarchically from coarse to fine resolutions on meshes, achieving lower distributional error and 2-7x faster runtime than diffusion or flow-matchi...

  46. Attention Sink in Transformers: A Survey on Utilization, Interpretation, and Mitigation

    cs.LG 2026-04 unverdicted novelty 7.0

    The first survey on Attention Sink in Transformers structures the literature around fundamental utilization, mechanistic interpretation, and strategic mitigation.

  47. Envisioning the Future, One Step at a Time

    cs.CV 2026-04 unverdicted novelty 7.0

    An autoregressive diffusion model on sparse point trajectories predicts multi-modal future scene dynamics from single images with orders-of-magnitude faster sampling than dense video simulators while matching accuracy.

  48. WOMBET: World Model-based Experience Transfer for Robust and Sample-efficient Reinforcement Learning

    cs.LG 2026-04 unverdicted novelty 7.0

    WOMBET generates reliable prior data with world-model uncertainty penalization and transfers it to target tasks via adaptive offline-online sampling, yielding better sample efficiency than baselines.

  49. Dual Triangle Attention: Effective Bidirectional Attention Without Positional Embeddings

    q-bio.QM 2026-04 unverdicted novelty 7.0

    Dual Triangle Attention achieves effective bidirectional attention with built-in positional inductive bias via dual triangular masks, outperforming standard bidirectional attention on position-sensitive tasks and show...

  50. Multi-Head Attention based interaction-aware architecture for Bangla Handwritten Character Recognition: Introducing a Primary Dataset

    cs.CV 2026-04 accept novelty 7.0

    A new balanced Bangla handwritten character dataset paired with a multi-head attention hybrid model using EfficientNetB3, ViT, and Conformer achieves high accuracy and strong generalization.

  51. Unified Vector Floorplan Generation via Markup Representation

    cs.CV 2026-04 unverdicted novelty 7.0

    A single transformer model using a new markup representation generates functional floorplans from diverse conditions and outperforms prior task-specific methods on the RPLAN dataset.

  52. Deformation-based In-Context Learning for Point Cloud Understanding

    cs.CV 2026-04 unverdicted novelty 7.0

    DeformPIC deforms query point clouds under prompt guidance for in-context learning, outperforming prior methods with lower Chamfer Distance on reconstruction, denoising, and registration tasks.

  53. Stochastic Policy Gradient Methods in the Uncertain Volatility Model

    q-fin.CP 2026-04 unverdicted novelty 7.0

    A neural-network actor-critic policy gradient algorithm with squashed Gaussian C-vine policies solves high-dimensional robust pricing problems in the uncertain volatility model and outperforms existing Monte Carlo and...

  54. Efficient Memory Management for Large Language Model Serving with PagedAttention

    cs.LG 2023-09 conditional novelty 7.0

    PagedAttention achieves near-zero waste in LLM key-value cache memory and enables 2-4x higher serving throughput than prior systems.

  55. Segment Anything

    cs.CV 2023-04 unverdicted novelty 7.0

    A promptable model trained on 1B masks achieves competitive zero-shot segmentation performance across tasks and is released publicly with its dataset.

  56. High Fidelity Neural Audio Compression

    eess.AS 2022-10 accept novelty 7.0

    EnCodec is an end-to-end trained streaming neural audio codec that uses a single multiscale spectrogram discriminator and a gradient-normalizing loss balancer to achieve higher fidelity than prior methods at the same ...

  57. DreamFusion: Text-to-3D using 2D Diffusion

    cs.CV 2022-09 accept novelty 7.0

    Optimizes a Neural Radiance Field via probability density distillation from a 2D diffusion model to produce text-conditioned 3D scenes viewable from any angle.

  58. LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

    cs.LG 2022-08 conditional novelty 7.0

    LLM.int8() performs 8-bit inference for transformers up to 175B parameters with no accuracy loss by combining vector-wise quantization for most features with 16-bit mixed-precision handling of systematic outlier dimensions.

  59. Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

    cs.CV 2022-05 accept novelty 7.0

    Imagen achieves state-of-the-art photorealistic text-to-image generation by scaling a text-only pretrained T5 language model within a diffusion framework, reaching FID 7.27 on COCO without training on it.

  60. A Generalist Agent

    cs.AI 2022-05 accept novelty 7.0

    Gato is a multi-modal, multi-task, multi-embodiment generalist policy using one transformer network to handle text, vision, games, and robotics tasks.