pith. machine review for the scientific record. sign in

arxiv: 1910.07467 · v1 · pith:QEET3LADnew · submitted 2019-10-16 · 💻 cs.LG · cs.CL· stat.ML

Root Mean Square Layer Normalization

Pith reviewed 2026-05-17 18:35 UTC · model grok-4.3

classification 💻 cs.LG cs.CLstat.ML
keywords rmsnormlayernormlayerinputsmeannormalizationrootsquare
0
0 comments X

The pith

RMSNorm delivers re-scaling invariance and comparable accuracy to LayerNorm while cutting computation by skipping mean subtraction, yielding 7-64% runtime reductions across tested models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Layer normalization stabilizes deep network training by recentering each layer's inputs around zero and rescaling them to unit variance. The recentering step requires computing a mean for every neuron, which adds overhead especially in recurrent models. The authors test the idea that this recentering step can be dropped. RMSNorm instead divides each summed input by its root mean square value. This keeps the rescaling benefit and an implicit learning-rate adaptation effect but removes the mean calculation. A partial version estimates the RMS from only a fraction of the inputs for further savings. Experiments on machine translation, language modeling, and other tasks with RNNs, CNNs, and transformers show that models trained with RMSNorm reach similar accuracy to LayerNorm versions yet run faster, with reported speedups between 7 and 64 percent depending on architecture. The authors release code so others can verify and adopt the method.

Core claim

Extensive experiments on several tasks using diverse network architectures show that RMSNorm achieves comparable performance against LayerNorm but reduces the running time by 7%~64% on different models.

Load-bearing premise

Re-centering invariance in LayerNorm is dispensable for the stabilization and convergence benefits the method provides.

read the original abstract

Layer normalization (LayerNorm) has been successfully applied to various deep neural networks to help stabilize training and boost model convergence because of its capability in handling re-centering and re-scaling of both inputs and weight matrix. However, the computational overhead introduced by LayerNorm makes these improvements expensive and significantly slows the underlying network, e.g. RNN in particular. In this paper, we hypothesize that re-centering invariance in LayerNorm is dispensable and propose root mean square layer normalization, or RMSNorm. RMSNorm regularizes the summed inputs to a neuron in one layer according to root mean square (RMS), giving the model re-scaling invariance property and implicit learning rate adaptation ability. RMSNorm is computationally simpler and thus more efficient than LayerNorm. We also present partial RMSNorm, or pRMSNorm where the RMS is estimated from p% of the summed inputs without breaking the above properties. Extensive experiments on several tasks using diverse network architectures show that RMSNorm achieves comparable performance against LayerNorm but reduces the running time by 7%~64% on different models. Source code is available at https://github.com/bzhangGo/rmsnorm.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on one domain assumption about LayerNorm properties and introduces no new free parameters or invented entities beyond the choice of which inputs to sample for the partial variant.

axioms (1)
  • domain assumption Re-centering invariance in LayerNorm is dispensable
    This premise is invoked to justify dropping the mean subtraction step while retaining the claimed benefits.

pith-pipeline@v0.9.0 · 5493 in / 1233 out tokens · 118523 ms · 2026-05-17T18:35:04.596165+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • Cost.FunctionalEquation J_symmetric echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    RMSNorm regularizes the summed inputs to a neuron in one layer according to root mean square (RMS), giving the model re-scaling invariance property

  • Foundation.DimensionForcing alexander_duality_circle_linking unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    we hypothesize that re-centering invariance in LayerNorm is dispensable

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. LoKA: Low-precision Kernel Applications for Recommendation Models At Scale

    cs.LG 2026-05 unverdicted novelty 6.0

    LoKA enables practical FP8 use in numerically sensitive large recommendation models via online profiling of activations, reusable model modifications for stability, and dynamic kernel dispatching.

  2. LoKA: Low-precision Kernel Applications for Recommendation Models At Scale

    cs.LG 2026-05 unverdicted novelty 6.0

    LoKA enables practical FP8 use in numerically sensitive large recommendation models via profiling, model adaptations, and runtime kernel orchestration.

  3. Velox: Learning Representations of 4D Geometry and Appearance

    cs.CV 2026-05 unverdicted novelty 6.0

    Velox compresses dynamic point clouds into latent tokens that support geometry via 4D surface modeling and appearance via 3D Gaussians, showing strong results on video-to-4D generation, tracking, and image-to-4D cloth...

  4. Demystifying Manifold Constraints in LLM Pre-training

    cs.LG 2026-05 unverdicted novelty 6.0

    Manifold constraints via the new MACRO optimizer independently bound activation scales and enforce rotational equilibrium in LLM pre-training, subsuming RMS normalization and decoupled weight decay while delivering co...

  5. State Stream Transformer (SST) V2: Parallel Training of Nonlinear Recurrence for Latent Space Reasoning

    cs.LG 2026-04 unverdicted novelty 6.0

    SST V2 introduces parallel-trainable nonlinear recurrence in latent space to let transformers reason continuously across positions, delivering +15 points on GPQA-Diamond and halving remaining GSM8K errors over matched...

  6. Hidden States Know Where Reasoning Diverges: Credit Assignment via Span-Level Wasserstein Distance

    cs.CL 2026-04 unverdicted novelty 6.0

    Span-level Wasserstein distances between hidden-state distributions of correct and incorrect rollouts provide a self-supervised signal to reweight advantages in GRPO, improving fine-grained credit assignment on math a...

  7. Parcae: Scaling Laws For Stable Looped Language Models

    cs.LG 2026-04 unverdicted novelty 6.0

    Parcae stabilizes looped LLMs via spectral norm constraints on injection parameters, enabling power-law scaling for training FLOPs and saturating exponential scaling at test time that improves quality over fixed-depth...

  8. HealDA: Highlighting the importance of initial errors in end-to-end AI weather forecasts

    physics.ao-ph 2026-01 conditional novelty 6.0

    HealDA supplies ML-based initial conditions for AI weather models that produce forecasts trailing ERA5-initialized runs by less than one day of effective lead time, with the skill gap arising mainly from initial error size.

  9. PipeWeave: Synergizing Analytical and Learning Models for Unified GPU Performance Prediction

    cs.PF 2026-01 unverdicted novelty 6.0

    PipeWeave predicts GPU kernel performance with 6.1% average error and end-to-end inference with 8.5% error by feeding analytical pipeline features into ML, cutting prior method errors by 4-7x across 11 GPUs.

  10. F1: A Vision-Language-Action Model Bridging Understanding and Generation to Actions

    cs.RO 2025-09 unverdicted novelty 6.0

    F1 integrates next-scale visual foresight prediction into a Mixture-of-Transformer VLA architecture to reformulate action generation as foresight-guided inverse dynamics, achieving higher success rates on 136 tasks.

  11. ST-MoE: Designing Stable and Transferable Sparse Expert Models

    cs.CL 2022-02 unverdicted novelty 6.0

    ST-MoE introduces stability techniques for sparse expert models, allowing a 269B-parameter model to achieve state-of-the-art transfer learning results across reasoning, summarization, and QA tasks at the compute cost ...

  12. CaloArt: Large-Patch x-Prediction Diffusion Transformers for High-Granularity Calorimeter Shower Generation

    physics.ins-det 2026-05 unverdicted novelty 5.0

    CaloArt achieves top FPD, high-level, and classifier metrics on CaloChallenge datasets 2 and 3 while keeping single-GPU generation at 9-11 ms per shower by combining large-patch tokenization, x-prediction, and conditi...

  13. mHC-SSM: Manifold-Constrained Hyper-Connections for State Space Language Models with Stream-Specialized Adapters

    cs.LG 2026-05 unverdicted novelty 5.0

    Manifold-constrained multi-stream mixing plus per-stream adapters improves SSM language model validation loss from 6.3507 to 6.1353 and perplexity from 572.91 to 461.88 on WikiText-2.

  14. Nautile-370M: Spectral Memory Meets Attention in a Small Reasoning Model

    cs.LG 2026-04 unverdicted novelty 5.0

    Nautile-370M is a hybrid small language model using SeqCond Attention layers alternating with transformers, with a claimed proof that the spectral operator matches full self-attention expressiveness in the continuous limit.

  15. NVIDIA Nemotron 3: Efficient and Open Intelligence

    cs.CL 2025-12 unverdicted novelty 5.0

    NVIDIA releases the Nemotron 3 model family with hybrid Mamba-Transformer architecture, LatentMoE, NVFP4 training, MTP layers, and multi-environment RL post-training for reasoning and agentic tasks.

  16. Hierarchical Reasoning Model

    cs.AI 2025-06 unverdicted novelty 5.0

    HRM is a recurrent architecture with high-level planning and low-level execution modules that reaches near-perfect accuracy on complex Sudoku, maze navigation, and ARC benchmarks using 27M parameters and 1000 samples ...

  17. Gemma: Open Models Based on Gemini Research and Technology

    cs.CL 2024-03 accept novelty 4.0

    Gemma introduces open 2B and 7B LLMs derived from Gemini technology that beat comparable open models on 11 of 18 text tasks and come with safety assessments.

  18. Gemma 2: Improving Open Language Models at a Practical Size

    cs.CL 2024-07 conditional novelty 3.0

    Gemma 2 models achieve leading performance at their sizes by combining established Transformer modifications with knowledge distillation for the 2B and 9B variants.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · cited by 17 Pith papers · 16 internal anchors

  1. [1]

    Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng

    Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. Tensorflow: A system for large-...

  2. [2]

    Normalization Propagation: A Parametric Technique for Removing Internal Covariate Shift in Deep Networks

    Devansh Arpit, Yingbo Zhou, Bhargava U Kota, and Venu Govindaraju. Normalization propa- gation: A parametric technique for removing internal covariate shift in deep networks. arXiv preprint arXiv:1603.01431, 2016

  3. [3]

    Layer Normalization

    Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016

  4. [4]

    Neural Machine Translation by Jointly Learning to Align and Translate

    Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv e-prints, abs/1409.0473, September 2014

  5. [5]

    Understanding batch normalization

    Nils Bjorck, Carla P Gomes, Bart Selman, and Kilian Q Weinberger. Understanding batch normalization. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa- Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31 , pages 7694–7705. Curran Associates, Inc., 2018. URL http://papers.nips.cc/paper/ 7996-understanding-batch...

  6. [6]

    The best of both worlds: Combining recent advances in neural machine translation

    Mia Xu Chen, Orhan Firat, Ankur Bapna, Melvin Johnson, Wolfgang Macherey, George Foster, Llion Jones, Mike Schuster, Noam Shazeer, Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Zhifeng Chen, Yonghui Wu, and Macduff Hughes. The best of both worlds: Combining recent advances in neural machine translation. In Proceedings of the 56th Annual Mee...

  7. [7]

    Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation

    Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder- decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014

  8. [8]

    Recurrent Batch Normalization

    Tim Cooijmans, Nicolas Ballas, César Laurent, Ça˘glar Gülçehre, and Aaron Courville. Recur- rent batch normalization. arXiv preprint arXiv:1603.09025, 2016

  9. [9]

    Teaching machines to read and comprehend

    Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. Teaching machines to read and comprehend. In Advances in Neural Information Processing Systems, pages 1693–1701, 2015

  10. [10]

    Norm matters: efficient and accurate normalization schemes in deep networks

    Elad Hoffer, Ron Banner, Itay Golan, and Daniel Soudry. Norm matters: efficient and accurate normalization schemes in deep networks. arXiv preprint arXiv:1803.01814, 2018

  11. [11]

    Batch renormalization: Towards reducing minibatch dependence in batch- normalized models

    Sergey Ioffe. Batch renormalization: Towards reducing minibatch dependence in batch- normalized models. In Advances in Neural Information Processing Systems , pages 1945–1953, 2017

  12. [12]

    Batch normalization: Accelerating deep network training by reducing internal covariate shift

    Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32Nd International Conference on International Conference on Machine Learning - V olume 37, ICML’15, pages 448–456, 2015

  13. [13]

    Adam: A Method for Stochastic Optimization

    Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014

  14. [14]

    Ryan Kiros, Ruslan Salakhutdinov, and Richard S. Zemel. Unifying visual-semantic embeddings with multimodal neural language models. CoRR, abs/1411.2539, 2014

  15. [15]

    Krizhevsky and G

    A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Master’s thesis, Department of Computer Science, University of Toronto , 2009. 10

  16. [16]

    Batch normalized recurrent neural networks

    César Laurent, Gabriel Pereyra, Philémon Brakel, Ying Zhang, and Yoshua Bengio. Batch normalized recurrent neural networks. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages 2657–2661. IEEE, 2016

  17. [17]

    Microsoft coco: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014

  18. [18]

    Improving Lexical Choice in Neural Machine Translation

    Toan Q Nguyen and David Chiang. Improving lexical choice in neural machine translation. arXiv preprint arXiv:1710.01329, 2017

  19. [19]

    Image Transformer

    Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Noam Shazeer, and Alexander Ku. Image transformer. arXiv preprint arXiv:1802.05751, 2018

  20. [20]

    Automatic differentiation in pytorch

    Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. In NIPS-W, 2017

  21. [21]

    A Call for Clarity in Reporting BLEU Scores

    Matt Post. A call for clarity in reporting bleu scores. arXiv preprint arXiv:1804.08771, 2018

  22. [22]

    Weight normalization: A simple reparameterization to accelerate training of deep neural networks

    Tim Salimans and Durk P Kingma. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. In Advances in Neural Information Processing Systems 29, pages 901–909. 2016

  23. [23]

    How does batch normalization help optimization? In Advances in Neural Information Processing Systems 31 , pages 2488–2498

    Shibani Santurkar, Dimitris Tsipras, Andrew Ilyas, and Aleksander Madry. How does batch normalization help optimization? In Advances in Neural Information Processing Systems 31 , pages 2488–2498. 2018

  24. [24]

    Neural Machine Translation of Rare Words with Subword Units

    Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909, 2015

  25. [25]

    Nematus: a Toolkit for Neural Machine Translation

    Rico Sennrich, Orhan Firat, Kyunghyun Cho, Alexandra Birch, Barry Haddow, Julian Hitschler, Marcin Junczys-Dowmunt, Samuel Läubli, Antonio Valerio Miceli Barone, Jozef Mokry, and Maria Nadejde. Nematus: a Toolkit for Neural Machine Translation. In Proceedings of the Software Demonstrations of the 15th Conference of the European Chapter of the Association ...

  26. [26]

    Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning

    Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 2556–2565, 2018

  27. [27]

    Improving predictive inference under covariate shift by weighting the log-likelihood function

    Hidetoshi Shimodaira. Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of Statistical Planning and Inference , 90(2):227–244, 2000

  28. [28]

    Very Deep Convolutional Networks for Large-Scale Image Recognition

    Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014

  29. [29]

    Theano: A Python framework for fast computation of mathematical expressions

    Theano Development Team. Theano: A Python framework for fast computation of mathematical expressions. arXiv e-prints, May 2016

  30. [30]

    Lempitsky

    Dmitry Ulyanov, Andrea Vedaldi, and Victor S. Lempitsky. Instance normalization: The missing ingredient for fast stylization. CoRR, 2016

  31. [31]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Informa- tion Processing Systems 30, pages 5998–6008. 2017

  32. [32]

    Order-Embeddings of Images and Language

    Ivan Vendrov, Ryan Kiros, Sanja Fidler, and Raquel Urtasun. Order-embeddings of images and language. arXiv preprint arXiv:1511.06361, 2015

  33. [33]

    L1-norm batch normalization for efficient training of deep neural networks

    Shuang Wu, Guoqi Li, Lei Deng, Liu Liu, Dong Wu, Yuan Xie, and Luping Shi. L1-norm batch normalization for efficient training of deep neural networks. IEEE transactions on neural networks and learning systems , 2018. 11

  34. [34]

    Group normalization

    Yuxin Wu and Kaiming He. Group normalization. In Proceedings of the European Conference on Computer Vision (ECCV), pages 3–19, 2018

  35. [35]

    A Lightweight Recurrent Network for Sequence Modeling

    Biao Zhang and Rico Sennrich. A lightweight recurrent network for sequence modeling. arXiv preprint arXiv:1905.13324, 2019

  36. [36]

    Dauphin, and Tengyu Ma

    Hongyi Zhang, Yann N. Dauphin, and Tengyu Ma. Residual learning without normalization via better initialization. In International Conference on Learning Representations , 2019

  37. [37]

    Syllable-Based Sequence-to-Sequence Speech Recognition with the Transformer in Mandarin Chinese

    Shiyu Zhou, Linhao Dong, Shuang Xu, and Bo Xu. Syllable-based sequence-to-sequence speech recognition with the transformer in mandarin chinese. arXiv preprint arXiv:1804.10752, 2018. 12 A Appendix A.1 Machine Translation We experiment on the WMT14 English-German translation task, where the training corpus consists of 4.5M aligned sentence pairs. We use ne...