arxiv: 1910.07467 · v1 · pith:QEET3LADnew · submitted 2019-10-16 · 💻 cs.LG · cs.CL· stat.ML

Root Mean Square Layer Normalization

Biao Zhang , Rico Sennrich This is my paper

Pith reviewed 2026-05-17 18:35 UTC · model grok-4.3

classification 💻 cs.LG cs.CLstat.ML

keywords rmsnormlayernormlayerinputsmeannormalizationrootsquare

0 comments

The pith

RMSNorm delivers re-scaling invariance and comparable accuracy to LayerNorm while cutting computation by skipping mean subtraction, yielding 7-64% runtime reductions across tested models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Layer normalization stabilizes deep network training by recentering each layer's inputs around zero and rescaling them to unit variance. The recentering step requires computing a mean for every neuron, which adds overhead especially in recurrent models. The authors test the idea that this recentering step can be dropped. RMSNorm instead divides each summed input by its root mean square value. This keeps the rescaling benefit and an implicit learning-rate adaptation effect but removes the mean calculation. A partial version estimates the RMS from only a fraction of the inputs for further savings. Experiments on machine translation, language modeling, and other tasks with RNNs, CNNs, and transformers show that models trained with RMSNorm reach similar accuracy to LayerNorm versions yet run faster, with reported speedups between 7 and 64 percent depending on architecture. The authors release code so others can verify and adopt the method.

Core claim

Extensive experiments on several tasks using diverse network architectures show that RMSNorm achieves comparable performance against LayerNorm but reduces the running time by 7%~64% on different models.

Load-bearing premise

Re-centering invariance in LayerNorm is dispensable for the stabilization and convergence benefits the method provides.

read the original abstract

Layer normalization (LayerNorm) has been successfully applied to various deep neural networks to help stabilize training and boost model convergence because of its capability in handling re-centering and re-scaling of both inputs and weight matrix. However, the computational overhead introduced by LayerNorm makes these improvements expensive and significantly slows the underlying network, e.g. RNN in particular. In this paper, we hypothesize that re-centering invariance in LayerNorm is dispensable and propose root mean square layer normalization, or RMSNorm. RMSNorm regularizes the summed inputs to a neuron in one layer according to root mean square (RMS), giving the model re-scaling invariance property and implicit learning rate adaptation ability. RMSNorm is computationally simpler and thus more efficient than LayerNorm. We also present partial RMSNorm, or pRMSNorm where the RMS is estimated from p% of the summed inputs without breaking the above properties. Extensive experiments on several tasks using diverse network architectures show that RMSNorm achieves comparable performance against LayerNorm but reduces the running time by 7%~64% on different models. Source code is available at https://github.com/bzhangGo/rmsnorm.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RMSNorm simplifies LayerNorm by dropping the mean term and delivers real speedups with comparable performance on the tested models.

read the letter

The punchline is that RMSNorm matches LayerNorm performance but runs faster by skipping the centering calculation. This holds across the architectures and tasks in the experiments. The paper introduces RMSNorm as a new operator that normalizes by root mean square rather than standard deviation after centering. They also define partial RMSNorm for estimating from a subset of inputs. This is new in the sense that it wasn't in the original LayerNorm work. What they do well is test it thoroughly on diverse setups like Transformer for translation, LSTM for LM, and ResNet for classification. The speedups are reported clearly, and the code is public, which strengthens the contribution. The results look reliable. They compare directly and show no major accuracy loss, supporting the idea that re-centering invariance is not essential for the stabilization benefits. The soft spots are minor. The paper could have more analysis on why this works or potential limitations in other domains, but the empirical focus is appropriate for this kind of practical paper. No issues with circularity or data reuse. This paper is for researchers and engineers working on efficient training of neural networks, particularly those using LayerNorm in RNNs or transformers. A reader interested in small changes that reduce compute without much risk would get value from it. I recommend engaging with the work and sending it to peer review. The contribution is solid enough to warrant referee time.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on one domain assumption about LayerNorm properties and introduces no new free parameters or invented entities beyond the choice of which inputs to sample for the partial variant.

axioms (1)

domain assumption Re-centering invariance in LayerNorm is dispensable
This premise is invoked to justify dropping the mean subtraction step while retaining the claimed benefits.

pith-pipeline@v0.9.0 · 5493 in / 1233 out tokens · 118523 ms · 2026-05-17T18:35:04.596165+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Cost.FunctionalEquation J_symmetric echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

RMSNorm regularizes the summed inputs to a neuron in one layer according to root mean square (RMS), giving the model re-scaling invariance property
Foundation.DimensionForcing alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we hypothesize that re-centering invariance in LayerNorm is dispensable

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

LoKA: Low-precision Kernel Applications for Recommendation Models At Scale
cs.LG 2026-05 unverdicted novelty 6.0

LoKA enables practical FP8 use in numerically sensitive large recommendation models via online profiling of activations, reusable model modifications for stability, and dynamic kernel dispatching.
LoKA: Low-precision Kernel Applications for Recommendation Models At Scale
cs.LG 2026-05 unverdicted novelty 6.0

LoKA enables practical FP8 use in numerically sensitive large recommendation models via profiling, model adaptations, and runtime kernel orchestration.
Velox: Learning Representations of 4D Geometry and Appearance
cs.CV 2026-05 unverdicted novelty 6.0

Velox compresses dynamic point clouds into latent tokens that support geometry via 4D surface modeling and appearance via 3D Gaussians, showing strong results on video-to-4D generation, tracking, and image-to-4D cloth...
Demystifying Manifold Constraints in LLM Pre-training
cs.LG 2026-05 unverdicted novelty 6.0

Manifold constraints via the new MACRO optimizer independently bound activation scales and enforce rotational equilibrium in LLM pre-training, subsuming RMS normalization and decoupled weight decay while delivering co...
State Stream Transformer (SST) V2: Parallel Training of Nonlinear Recurrence for Latent Space Reasoning
cs.LG 2026-04 unverdicted novelty 6.0

SST V2 introduces parallel-trainable nonlinear recurrence in latent space to let transformers reason continuously across positions, delivering +15 points on GPQA-Diamond and halving remaining GSM8K errors over matched...
Hidden States Know Where Reasoning Diverges: Credit Assignment via Span-Level Wasserstein Distance
cs.CL 2026-04 unverdicted novelty 6.0

Span-level Wasserstein distances between hidden-state distributions of correct and incorrect rollouts provide a self-supervised signal to reweight advantages in GRPO, improving fine-grained credit assignment on math a...
Parcae: Scaling Laws For Stable Looped Language Models
cs.LG 2026-04 unverdicted novelty 6.0

Parcae stabilizes looped LLMs via spectral norm constraints on injection parameters, enabling power-law scaling for training FLOPs and saturating exponential scaling at test time that improves quality over fixed-depth...
HealDA: Highlighting the importance of initial errors in end-to-end AI weather forecasts
physics.ao-ph 2026-01 conditional novelty 6.0

HealDA supplies ML-based initial conditions for AI weather models that produce forecasts trailing ERA5-initialized runs by less than one day of effective lead time, with the skill gap arising mainly from initial error size.
PipeWeave: Synergizing Analytical and Learning Models for Unified GPU Performance Prediction
cs.PF 2026-01 unverdicted novelty 6.0

PipeWeave predicts GPU kernel performance with 6.1% average error and end-to-end inference with 8.5% error by feeding analytical pipeline features into ML, cutting prior method errors by 4-7x across 11 GPUs.
F1: A Vision-Language-Action Model Bridging Understanding and Generation to Actions
cs.RO 2025-09 unverdicted novelty 6.0

F1 integrates next-scale visual foresight prediction into a Mixture-of-Transformer VLA architecture to reformulate action generation as foresight-guided inverse dynamics, achieving higher success rates on 136 tasks.
ST-MoE: Designing Stable and Transferable Sparse Expert Models
cs.CL 2022-02 unverdicted novelty 6.0

ST-MoE introduces stability techniques for sparse expert models, allowing a 269B-parameter model to achieve state-of-the-art transfer learning results across reasoning, summarization, and QA tasks at the compute cost ...
CaloArt: Large-Patch x-Prediction Diffusion Transformers for High-Granularity Calorimeter Shower Generation
physics.ins-det 2026-05 unverdicted novelty 5.0

CaloArt achieves top FPD, high-level, and classifier metrics on CaloChallenge datasets 2 and 3 while keeping single-GPU generation at 9-11 ms per shower by combining large-patch tokenization, x-prediction, and conditi...
mHC-SSM: Manifold-Constrained Hyper-Connections for State Space Language Models with Stream-Specialized Adapters
cs.LG 2026-05 unverdicted novelty 5.0

Manifold-constrained multi-stream mixing plus per-stream adapters improves SSM language model validation loss from 6.3507 to 6.1353 and perplexity from 572.91 to 461.88 on WikiText-2.
Nautile-370M: Spectral Memory Meets Attention in a Small Reasoning Model
cs.LG 2026-04 unverdicted novelty 5.0

Nautile-370M is a hybrid small language model using SeqCond Attention layers alternating with transformers, with a claimed proof that the spectral operator matches full self-attention expressiveness in the continuous limit.
NVIDIA Nemotron 3: Efficient and Open Intelligence
cs.CL 2025-12 unverdicted novelty 5.0

NVIDIA releases the Nemotron 3 model family with hybrid Mamba-Transformer architecture, LatentMoE, NVFP4 training, MTP layers, and multi-environment RL post-training for reasoning and agentic tasks.
Hierarchical Reasoning Model
cs.AI 2025-06 unverdicted novelty 5.0

HRM is a recurrent architecture with high-level planning and low-level execution modules that reaches near-perfect accuracy on complex Sudoku, maze navigation, and ARC benchmarks using 27M parameters and 1000 samples ...
Gemma: Open Models Based on Gemini Research and Technology
cs.CL 2024-03 accept novelty 4.0

Gemma introduces open 2B and 7B LLMs derived from Gemini technology that beat comparable open models on 11 of 18 text tasks and come with safety assessments.
Gemma 2: Improving Open Language Models at a Practical Size
cs.CL 2024-07 conditional novelty 3.0

Gemma 2 models achieve leading performance at their sizes by combining established Transformer modifications with knowledge distillation for the 2B and 9B variants.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · cited by 17 Pith papers · 16 internal anchors

[1]

Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng

Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. Tensorﬂow: A system for large-...

work page 2016
[2]

Normalization Propagation: A Parametric Technique for Removing Internal Covariate Shift in Deep Networks

Devansh Arpit, Yingbo Zhou, Bhargava U Kota, and Venu Govindaraju. Normalization propa- gation: A parametric technique for removing internal covariate shift in deep networks. arXiv preprint arXiv:1603.01431, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[3]

Layer Normalization

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[4]

Neural Machine Translation by Jointly Learning to Align and Translate

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv e-prints, abs/1409.0473, September 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[5]

Understanding batch normalization

Nils Bjorck, Carla P Gomes, Bart Selman, and Kilian Q Weinberger. Understanding batch normalization. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa- Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31 , pages 7694–7705. Curran Associates, Inc., 2018. URL http://papers.nips.cc/paper/ 7996-understanding-batch...

work page 2018
[6]

The best of both worlds: Combining recent advances in neural machine translation

Mia Xu Chen, Orhan Firat, Ankur Bapna, Melvin Johnson, Wolfgang Macherey, George Foster, Llion Jones, Mike Schuster, Noam Shazeer, Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Zhifeng Chen, Yonghui Wu, and Macduff Hughes. The best of both worlds: Combining recent advances in neural machine translation. In Proceedings of the 56th Annual Mee...

work page 2018
[7]

Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation

Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder- decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[8]

Recurrent Batch Normalization

Tim Cooijmans, Nicolas Ballas, César Laurent, Ça˘glar Gülçehre, and Aaron Courville. Recur- rent batch normalization. arXiv preprint arXiv:1603.09025, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[9]

Teaching machines to read and comprehend

Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. Teaching machines to read and comprehend. In Advances in Neural Information Processing Systems, pages 1693–1701, 2015

work page 2015
[10]

Norm matters: efficient and accurate normalization schemes in deep networks

Elad Hoffer, Ron Banner, Itay Golan, and Daniel Soudry. Norm matters: efﬁcient and accurate normalization schemes in deep networks. arXiv preprint arXiv:1803.01814, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[11]

Batch renormalization: Towards reducing minibatch dependence in batch- normalized models

Sergey Ioffe. Batch renormalization: Towards reducing minibatch dependence in batch- normalized models. In Advances in Neural Information Processing Systems , pages 1945–1953, 2017

work page 1945
[12]

Batch normalization: Accelerating deep network training by reducing internal covariate shift

Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32Nd International Conference on International Conference on Machine Learning - V olume 37, ICML’15, pages 448–456, 2015

work page 2015
[13]

Adam: A Method for Stochastic Optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[14]

Ryan Kiros, Ruslan Salakhutdinov, and Richard S. Zemel. Unifying visual-semantic embeddings with multimodal neural language models. CoRR, abs/1411.2539, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[15]

Krizhevsky and G

A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Master’s thesis, Department of Computer Science, University of Toronto , 2009. 10

work page 2009
[16]

Batch normalized recurrent neural networks

César Laurent, Gabriel Pereyra, Philémon Brakel, Ying Zhang, and Yoshua Bengio. Batch normalized recurrent neural networks. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages 2657–2661. IEEE, 2016

work page 2016
[17]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014

work page 2014
[18]

Improving Lexical Choice in Neural Machine Translation

Toan Q Nguyen and David Chiang. Improving lexical choice in neural machine translation. arXiv preprint arXiv:1710.01329, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[19]

Image Transformer

Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Noam Shazeer, and Alexander Ku. Image transformer. arXiv preprint arXiv:1802.05751, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[20]

Automatic differentiation in pytorch

Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. In NIPS-W, 2017

work page 2017
[21]

A Call for Clarity in Reporting BLEU Scores

Matt Post. A call for clarity in reporting bleu scores. arXiv preprint arXiv:1804.08771, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[22]

Weight normalization: A simple reparameterization to accelerate training of deep neural networks

Tim Salimans and Durk P Kingma. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. In Advances in Neural Information Processing Systems 29, pages 901–909. 2016

work page 2016
[23]

How does batch normalization help optimization? In Advances in Neural Information Processing Systems 31 , pages 2488–2498

Shibani Santurkar, Dimitris Tsipras, Andrew Ilyas, and Aleksander Madry. How does batch normalization help optimization? In Advances in Neural Information Processing Systems 31 , pages 2488–2498. 2018

work page 2018
[24]

Neural Machine Translation of Rare Words with Subword Units

Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[25]

Nematus: a Toolkit for Neural Machine Translation

Rico Sennrich, Orhan Firat, Kyunghyun Cho, Alexandra Birch, Barry Haddow, Julian Hitschler, Marcin Junczys-Dowmunt, Samuel Läubli, Antonio Valerio Miceli Barone, Jozef Mokry, and Maria Nadejde. Nematus: a Toolkit for Neural Machine Translation. In Proceedings of the Software Demonstrations of the 15th Conference of the European Chapter of the Association ...

work page 2017
[26]

Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning

Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 2556–2565, 2018

work page 2018
[27]

Improving predictive inference under covariate shift by weighting the log-likelihood function

Hidetoshi Shimodaira. Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of Statistical Planning and Inference , 90(2):227–244, 2000

work page 2000
[28]

Very Deep Convolutional Networks for Large-Scale Image Recognition

Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[29]

Theano: A Python framework for fast computation of mathematical expressions

Theano Development Team. Theano: A Python framework for fast computation of mathematical expressions. arXiv e-prints, May 2016

work page 2016
[30]

Lempitsky

Dmitry Ulyanov, Andrea Vedaldi, and Victor S. Lempitsky. Instance normalization: The missing ingredient for fast stylization. CoRR, 2016

work page 2016
[31]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Informa- tion Processing Systems 30, pages 5998–6008. 2017

work page 2017
[32]

Order-Embeddings of Images and Language

Ivan Vendrov, Ryan Kiros, Sanja Fidler, and Raquel Urtasun. Order-embeddings of images and language. arXiv preprint arXiv:1511.06361, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[33]

L1-norm batch normalization for efﬁcient training of deep neural networks

Shuang Wu, Guoqi Li, Lei Deng, Liu Liu, Dong Wu, Yuan Xie, and Luping Shi. L1-norm batch normalization for efﬁcient training of deep neural networks. IEEE transactions on neural networks and learning systems , 2018. 11

work page 2018
[34]

Group normalization

Yuxin Wu and Kaiming He. Group normalization. In Proceedings of the European Conference on Computer Vision (ECCV), pages 3–19, 2018

work page 2018
[35]

A Lightweight Recurrent Network for Sequence Modeling

Biao Zhang and Rico Sennrich. A lightweight recurrent network for sequence modeling. arXiv preprint arXiv:1905.13324, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1905
[36]

Dauphin, and Tengyu Ma

Hongyi Zhang, Yann N. Dauphin, and Tengyu Ma. Residual learning without normalization via better initialization. In International Conference on Learning Representations , 2019

work page 2019
[37]

Syllable-Based Sequence-to-Sequence Speech Recognition with the Transformer in Mandarin Chinese

Shiyu Zhou, Linhao Dong, Shuang Xu, and Bo Xu. Syllable-based sequence-to-sequence speech recognition with the transformer in mandarin chinese. arXiv preprint arXiv:1804.10752, 2018. 12 A Appendix A.1 Machine Translation We experiment on the WMT14 English-German translation task, where the training corpus consists of 4.5M aligned sentence pairs. We use ne...

work page internal anchor Pith review Pith/arXiv arXiv 2018