arxiv: 2605.14521 · v1 · submitted 2026-05-14 · 💻 cs.LG

Recognition: 1 theorem link

· Lean Theorem

Enjoy Your Layer Normalization with the Computational Efficiency of RMSNorm

Yuxin Guo , Yihao Yue , Yunhao Ni , Yizhou Ruan , Jie Luo , Wenjun Wu , Lei Huang

Authors on Pith no claims yet

Pith reviewed 2026-05-15 02:19 UTC · model grok-4.3

classification 💻 cs.LG

keywords layer normalizationRMSNorminference accelerationfoldable LNcentering operationdeep neural networksexact equivalence

0 comments

The pith

Many layer normalizations in standard networks can be folded exactly into upstream layers, allowing precise replacement by faster RMSNorm at inference time with no change in predictions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a way to test whether any given layer normalization can be replaced by RMSNorm without changing the model's computed function. The test works by checking if the centering step inside LN can be absorbed into the preceding linear layers through a zero-mean constraint on their outputs. When this condition holds, the LN is called foldable. A graph algorithm detects foldable cases across entire networks. In practice this covers many layers in common architectures, so the centering arithmetic can be dropped at inference while the numerical outputs stay identical.

Core claim

An LN is foldable when its centering operation can be moved into upstream linear layers by enforcing the column-centered constraint on activations and column-based weight centering on the weight matrices, leaving the overall input-output mapping unchanged. A graph-based detection procedure identifies all such foldable LNs in an arbitrary DNN. Once identified, each foldable LN converts exactly to an RMSNorm at inference, removing the per-sample mean subtraction while preserving exact equivalence to the original model.

What carries the argument

Foldable LN, identified by whether the centering step can be absorbed via column-centered constraint and column-based weight centering on upstream linear layers.

Load-bearing premise

Enforcing zero-mean outputs on upstream linear layers through column-centered constraints and weight centering leaves the overall model function unchanged.

What would settle it

Run the conversion on a network where the detection algorithm reports non-foldable LNs and measure whether the numerical outputs or task accuracy differ from the original LN model.

Figures

Figures reproduced from arXiv: 2605.14521 by Jie Luo, Lei Huang, Wenjun Wu, Yihao Yue, Yizhou Ruan, Yunhao Ni, Yuxin Guo.

**Figure 1.** Figure 1: Overview of the method. W denotes the weight matrix of a general linear layer, and W∗ is applied with CBWC which satisfies CCC. automatically detecting and replacing foldable LNs. Finally, in Section 5, we conduct comprehensive experiments to validate the effectiveness and efficiency of the proposed approach. 3. Equivalent Replacement of LN after Linear Layers In this section, we present a theoretical anal… view at source ↗

**Figure 2.** Figure 2: Sketch map of the two training scheme. The proof of Prop. 3.2 is given in Appendix B.2. Once CBWC is applied to the upstream general linear layer, RMSNorm can replace the LN while producing identical outputs and gradients, yet at lower computational cost. The resulting CBWC+RMSNorm scheme therefore yields parameters that are mathematically equivalent to those of the original LNbased model, which we furth… view at source ↗

**Figure 3.** Figure 3: An example network and its associated zero-mean graph. Layer D is an LN, layer E is a scalar operation, and ⊕ denotes residual addition. Layer D is foldable if and only if vA, vB ∈ Vl. 4.2. Analysis of Neural Networks from a Graph Perspective We model the network as a directed graph G = (V, E). Each vertex v ∈ V corresponds to a layer, represented as a function f : R m → R n, for some m, n ∈ N+. Each direc… view at source ↗

**Figure 4.** Figure 4: Norm of the input to the final layer for MLPs of different depths (d). LN more effectively controls the sample norm before the final layer throughout training. 5.1. Observation of Centering Although the benefits of centering in LN remain debated, we observe that centering helps stabilize the activation range of a network. To examine this effect, we conduct ablation experiments on MLPs of different depths … view at source ↗

**Figure 6.** Figure 6: Training loss curves and validation BLEU scores of Transformer on the Multi30K translation task. Our proposed CBWC+RMSNorm achieves final performance between standard LN and vanilla RMSNorm. (a) Training accuracy. (b) Test accuracy. (c) Training time. (d) Test time [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Training and test performance of Transformer-based text classification on the AG News dataset. The results are averaged over 5 random seeds with shaded regions indicating standard deviation. Our CBWC+RMSNorm matches the convergence behavior and final accuracy of standard LN while achieving nearly the same training and inference throughput as vanilla RMSNorm. tions, thereby breaking exact equivalence. We t… view at source ↗

**Figure 8.** Figure 8: End-to-end latency comparison of Swin-Tiny variants for different processes on ImageNet-100. Our CBWC+RMSNorm delivers markedly lower latency than LN in both forward and backward passes, approaching the efficiency of vanilla RMSNorm. Text Classification. We train the models on the AG News (Zhang et al., 2015) dataset using the same Transformer setting as in the translation experiment. Each model is traine… view at source ↗

**Figure 9.** Figure 9: Sketch map of CCWT. To achieve the CCC of a general linear layer in inference and ensure a subsequent LN foldable, we propose column-centered weight transformation. The aim of CCC is to fold centering operation (in Eqn.4) into linear layer. Notice that given a layer input vector x = [x1, x2, . . . , xn] ⊤ ∈ R n, the centering operation can be written into the form of xe = [PITH_FULL_IMAGE:figures/full_fig… view at source ↗

**Figure 10.** Figure 10: The proof and application of our method on Post-LN. ‘Att’ and ‘FFN’ refer to the Attention layer and the Feed-Forward Network, which are both general linear layers. For the post-LN transformer, the residual structure connects an LN and an self-attention module or a feed-forward network layer. Obviously, all the LNs are foldable. D.3.3. PRE-LN TRANSFORMER Although the embedding block does not provide zero-… view at source ↗

**Figure 11.** Figure 11: The proof and application of our method on Pre-LN. ‘Att’ and ‘FFN’ refers to the Attention layer and Feed-Forward Network, which are both general linear layers. ‘Embed’ refers to the Embedding layer, which it is not. E. Implementation Details of Algorithm 1 E.1. Foldable LN in Common Models Here, we list 11 common models and the number of LN and foldable LN in [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗

**Figure 12.** Figure 12: Mean of the final layer’s input for MLPs of different depth (d). The change is similar to the change of norm. 0 1 2 3 4 5 6 Depth (Linear Layers) 200 300 400 500 600 Input Norm LN RMSNorm (a) d=6 on CIFAR-10. 0 1 2 3 4 5 6 7 8 9 101112131415 Depth (Linear Layers) 200 250 300 350 400 Input Norm LN RMSNorm (b) d=15 on CIFAR-10. 0 5 10 15 20 25 30 35 Depth (Linear Layers) 200 250 300 350 400 Input Norm LN RM… view at source ↗

**Figure 13.** Figure 13: Norm of the input across linear layers for MLPs of different depth (d). LN better controls the norm of samples throughout the model. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_13.png] view at source ↗

**Figure 14.** Figure 14: Norm of the input across activation layers for MLPs of different depth d. The change is similar to the change of norm. 0 1 2 3 4 5 Depth (Normalization Layers) 110 120 130 140 Input Norm LN RMSNorm (a) d=6 on CIFAR-10. 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Depth (Normalization Layers) 120 130 140 150 160 170 Input Norm LN RMSNorm (b) d=15 on CIFAR-10. 0 5 10 15 20 25 30 34 Depth (Normalization Layers) 120 13… view at source ↗

**Figure 15.** Figure 15: Norm of the input across normalization layers for MLPs of different depth d. We also observed the variations in input norms across other layers. As depicted in [PITH_FULL_IMAGE:figures/full_fig_p023_15.png] view at source ↗

**Figure 16.** Figure 16: Inference latency comparison across six representative models (GPT-2, BERT, BLOOM, OPT, Phi-3, and ViT). Our CBWC+RMSNorm achieves a consistent end-to-end runtime reduction of 2%–12% with no accuracy degradation. We also measure the proportion of inference runtime originally spent in LN for these models, as reported in [PITH_FULL_IMAGE:figures/full_fig_p027_16.png] view at source ↗

**Figure 17.** Figure 17: Inference latency comparison under longer sequence lengths. Our CBWC+RMSNorm achieves larger end-to-end speedups as sequence length increases. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_17.png] view at source ↗

**Figure 18.** Figure 18: Performance of transformer models for text classification task of Transformer on the AG News dataset. The results are averaged over 5 random seeds with shaded regions indicating standard deviation The experiment is conducted on a 3090Ti. The average results are reported in [PITH_FULL_IMAGE:figures/full_fig_p031_18.png] view at source ↗

**Figure 19.** Figure 19: Train accuracy of three models and learning rate setting of SWIN transformer on Imagenet100 under high learning rate setting. Under a higher learning rate (10−3 ), CBWC+RMS performs between the LN and RMSNorm variants. As shown in [PITH_FULL_IMAGE:figures/full_fig_p032_19.png] view at source ↗

**Figure 20.** Figure 20: Performance of the three models in SWIN transformer on Imagenet100 under small learning rate. Under a smaller learning rate (10−5 ), the accuracy and loss of LN variant and RMSNorm are almost the same, while our method has different performance. Our method shows slightly better generalization performance than the variants using only RMSNorm or LN, with slightly higher test accuracy and lower test loss (as… view at source ↗

**Figure 21.** Figure 21: Comparison of WA and WB. H.4.2. EXPERIMENT SETTING FOR TRANSFORMER We pre-train the model with a learning rate of 5 × 10−4 and an effective batch size of 128 (batch size is 2 and gradient accumulation is 64). For fine-tuning, we have a learning rate of 5 × 10−5 and an effective batch size of 16 (batch size is 2 and gradient accumulation is 8). I. Open-source Code To facilitate verification and further exp… view at source ↗

read the original abstract

Layer normalization (LN) is a fundamental component in modern deep learning, but its per-sample centering and scaling introduce non-negligible inference overhead. RMSNorm improves efficiency by removing the centering operation, yet this may discard benefits associated with centering. This paper propose a framework to determine whether an LN in an arbitrary DNN can be replaced by RMSNorm without changing the model function. The key idea is to fold LN's centering operation into upstream general linear layers by enforcing zero-mean outputs through the column-centered constraint (CCC) and column-based weight centering (CBWC). We extend the analysis to arbitrary DNNs, define such LNs as foldable LNs, and develop a graph-based detection algorithm. Our analysis shows that many LNs in widely used architectures are foldable, enabling exact inference-time conversion and end-to-end acceleration of 2% to 12% without changing model predictions. Experiments across multiple task families further show that, when exact equivalence is partially broken in practical training settings, our method remains competitive with vanilla LN while improving efficiency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper gives a way to detect foldable LayerNorms in arbitrary networks and replace them exactly with RMSNorm at inference by folding the centering step into upstream linear layers under CCC and CBWC constraints.

read the letter

The main takeaway is that many LayerNorms turn out to be foldable, so you can swap them for RMSNorm at inference time without changing outputs and pick up 2-12% end-to-end speed. The paper shows this by enforcing zero-mean outputs on the preceding linear layers through the column-centered constraint and column-based weight centering, then uses a graph algorithm to find those spots across any DNN architecture. They report that common models already contain plenty of these foldable cases. What is new is the detection framework itself—the CCC/CBWC constraints plus the graph-based search for arbitrary networks. Earlier comparisons of LN and RMSNorm did not give a general method to keep the centering benefit while using the cheaper norm. The work does well by staying grounded in explicit linear-layer properties instead of fitting or approximation, and by applying the idea to real architectures with concrete acceleration numbers that matter for deployment. The soft spots are in the verification steps. The abstract sketches the exact-equivalence derivation and the relaxed-training results, but without the full math and experimental controls it is hard to judge how cleanly the constraints can be enforced during training or how much accuracy holds when equivalence is only approximate. Those sections look like the parts that would need the most referee attention. This paper is aimed at engineers and researchers who optimize inference for transformers and similar models and want to keep LN-style centering without paying the full compute cost. A reader working on production efficiency or normalization variants would get direct use from the detection algorithm and the reported gains. I would send it to peer review. The core idea is practical, the novelty in the detection method is real, and the potential deployment impact is clear enough that referees should check the details rather than desk-reject.

Referee Report

2 major / 2 minor

Summary. The manuscript presents a framework to replace Layer Normalization (LN) with RMSNorm in arbitrary DNNs by folding LN's centering into upstream linear layers. This is achieved by enforcing the column-centered constraint (CCC) and column-based weight centering (CBWC) on those layers, defining such LNs as 'foldable,' and providing a graph-based detection algorithm. The paper claims many LNs in widely used architectures are foldable, enabling exact inference-time conversion with 2-12% end-to-end acceleration without changing model predictions; experiments show competitiveness with vanilla LN even when exact equivalence is relaxed in training.

Significance. If the exact equivalence and foldability claims hold, the work would enable practical efficiency gains in inference for transformers and similar models by using faster RMSNorm while preserving centering benefits where possible. The graph-based detection algorithm and analysis of foldable LNs across architectures represent a useful contribution for model optimization. However, the low soundness rating and missing full derivations limit immediate impact until verified.

major comments (2)

[§3] §3 (framework and CCC/CBWC definitions): The central claim of exact model-function preservation when folding LN centering via CCC and CBWC on upstream linear layers lacks a complete, self-contained derivation for arbitrary computation graphs; this is load-bearing for the foldability definition and graph-based detection algorithm.
[Experiments] Experiments section: The reported 2-12% acceleration and competitive results when relaxing exact equivalence are stated without accompanying tables, specific model architectures, or quantitative details on how CBWC is enforced during training, making it impossible to assess whether the weakest assumption (no major retraining adjustments needed) holds.

minor comments (2)

[Abstract] Abstract: 'This paper propose' contains a subject-verb agreement error and should read 'This paper proposes'.
[§2] Notation for CCC and CBWC is introduced without an explicit summary table relating them to standard LN equations, which would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments below and will revise the manuscript to strengthen the presentation of the framework and experiments.

read point-by-point responses

Referee: [§3] §3 (framework and CCC/CBWC definitions): The central claim of exact model-function preservation when folding LN centering via CCC and CBWC on upstream linear layers lacks a complete, self-contained derivation for arbitrary computation graphs; this is load-bearing for the foldability definition and graph-based detection algorithm.

Authors: We agree that a fully self-contained derivation would improve clarity. In the revised manuscript we will insert a new subsection in §3 that derives, from first principles and for arbitrary directed acyclic computation graphs, that applying CCC to the upstream linear-layer outputs together with CBWC on the weights yields exact numerical equivalence between the original LN and the folded RMSNorm. The graph-based detection algorithm will then be shown to follow directly as a reachability query on the resulting constraint graph. revision: yes
Referee: [Experiments] Experiments section: The reported 2-12% acceleration and competitive results when relaxing exact equivalence are stated without accompanying tables, specific model architectures, or quantitative details on how CBWC is enforced during training, making it impossible to assess whether the weakest assumption (no major retraining adjustments needed) holds.

Authors: We accept that the current experimental description is insufficiently detailed. The revision will add a dedicated experimental subsection containing (i) tables with per-model latency and throughput numbers on BERT-base, GPT-2, and ViT-B/16, (ii) the precise regularization coefficient and schedule used to enforce CBWC, and (iii) ablation results confirming that the same training hyper-parameters as vanilla LN suffice. These additions will make the 2–12 % end-to-end gains and the “no major retraining” claim verifiable. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained via explicit constraints

full rationale

The paper's core chain defines foldable LNs via the column-centered constraint (CCC) and column-based weight centering (CBWC) applied to upstream linear layers, then uses a graph-based algorithm to detect them in arbitrary DNNs. This allows exact folding of LN centering into RMSNorm at inference without altering the model function. No step reduces a prediction to a fitted parameter, renames a known result, or relies on a self-citation chain for the uniqueness or validity of the equivalence; the constraints are stated as enforceable properties that preserve overall function by construction. The reported speedups follow directly from the detection and conversion procedure rather than from any self-referential input.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that centering can be exactly folded via zero-mean constraints on linear layers; no free parameters or new entities are introduced.

axioms (1)

domain assumption The centering operation of LN can be exactly folded into upstream linear layers by enforcing zero-mean outputs through CCC and CBWC.
This premise enables the exact equivalence claim and is invoked to define foldable LNs.

pith-pipeline@v0.9.0 · 5494 in / 1158 out tokens · 100106 ms · 2026-05-15T02:19:31.026410+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce the column-centered constraint (CCC) ... enforced through column-based weight centering (CBWC)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

105 extracted references · 105 canonical work pages · 8 internal anchors

[1]

ICML , year =

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , author =. ICML , year =

work page
[2]

arXiv preprint arXiv:1906.05849 , year=

Contrastive Multiview Coding , author=. arXiv preprint arXiv:1906.05849 , year=

work page arXiv 1906
[3]

Chun-Hsiao Yeh, Yubei Chen , howpublished=

work page
[4]

Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , year=

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , year=

work page
[5]

2016 , eprint=

Layer Normalization , author=. 2016 , eprint=

work page 2016
[6]

Root Mean Square Layer Normalization , year =

Zhang, Biao and Sennrich, Rico , booktitle =. Root Mean Square Layer Normalization , year =

work page
[7]

ICML , year =

Xiong, Ruibin and Yang, Yunchang and He, Di and Zheng, Kai and Zheng, Shuxin and Xing, Chen and Zhang, Huishuai and Lan, Yanyan and Wang, Liwei and Liu, Tie-Yan , title =. ICML , year =

work page
[8]

ICML , year=

PowerNorm: Rethinking Batch Normalization in Transformers , author=. ICML , year=

work page
[9]

Attention is All you Need , year =

Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, ukasz and Polosukhin, Illia , booktitle =. Attention is All you Need , year =

work page
[10]

BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding

Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina. BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding. ACL. 2019

work page 2019
[11]

How Does Batch Normalization Help Optimization? , year =

Santurkar, Shibani and Tsipras, Dimitris and Ilyas, Andrew and Madry, Aleksander , booktitle =. How Does Batch Normalization Help Optimization? , year =

work page
[12]

Understanding Batch Normalization , year =

Bjorck, Nils and Gomes, Carla P and Selman, Bart and Weinberger, Kilian Q , booktitle =. Understanding Batch Normalization , year =

work page
[13]

LeCun, Yann and Bengio, Yoshua and Hinton, Geoffrey , journal =

work page
[14]

Query-Key Normalization for Transformers

Henry, Alex and Dachapally, Prudhvi Raj and Pawar, Shubham Shantaram and Chen, Yuxuan. Query-Key Normalization for Transformers. EMNLP. 2020

work page 2020
[15]

Transformer- XL : Attentive Language Models beyond a Fixed-Length Context

Dai, Zihang and Yang, Zhilin and Yang, Yiming and Carbonell, Jaime and Le, Quoc and Salakhutdinov, Ruslan. Transformer- XL : Attentive Language Models beyond a Fixed-Length Context. ACL. 2019

work page 2019
[16]

Bach and Thomas Hofmann and Aurélien Lucchi , title=

Hadi Daneshmand and Jonas Moritz Kohler and Francis R. Bach and Thomas Hofmann and Aurélien Lucchi , title=. 2020 , booktitle=

work page 2020
[17]

NeurIPS , year=

Batch Normalization Orthogonalizes Representations in Deep Random Networks , author=. NeurIPS , year=

work page
[18]

Huang and Y

L. Huang and Y. Zhou and F. Zhu and L. Liu and L. Shao , booktitle =. Iterative Normalization: Beyond Standardization Towards Efficient Whitening , year =

work page
[19]

Layer-Wise Conditioning Analysis in Exploring the Learning Dynamics of DNNs

Huang, Lei and Qin, Jie and Liu, Li and Zhu, Fan and Shao, Ling. Layer-Wise Conditioning Analysis in Exploring the Learning Dynamics of DNNs. ECCV. 2020

work page 2020
[20]

ICLR , year=

Towards Understanding Regularization in Batch Normalization , author=. ICLR , year=

work page
[21]

CoRR , volume =

Yuxin Wu and Justin Johnson , title =. CoRR , volume =. 2021 , eprinttype =

work page 2021
[22]

Improving language understanding by generative pre-training , author=

work page
[23]

Language Models are Unsupervised Multitask Learners , author=

work page
[24]

Language Models are Few-Shot Learners , year =

Brown, Tom and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared D and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and Agarwal, Sandhini and others , booktitle =. Language Models are Few-Shot Learners , year =

work page
[25]

Adam: A Method for Stochastic Optimization

Kingma, Diederik P and Ba, Jimmy , interhash =. arXiv preprint arXiv:1412.6980 , keywords =

work page internal anchor Pith review Pith/arXiv arXiv
[26]

ArXiv , year=

Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling , author=. ArXiv , year=

work page
[27]

and Fei-Fei, Li and Dong, Wei and Li, Kai and Li, Li-Jia , booktitle =

Deng, Jia and Socher, R. and Fei-Fei, Li and Dong, Wei and Li, Kai and Li, Li-Jia , booktitle =

work page
[28]

, booktitle =

Krizhevsky, Alex and Sutskever, Ilya and Hinton, Geoffrey E. , booktitle =

work page
[29]

OPT: Open Pre-trained Transformer Language Models

Opt: Open pre-trained transformer language models , author=. arXiv preprint arXiv:2205.01068 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[30]

Learning Multiple Layers of Features from Tiny Images , author=

work page
[31]

Proceedings of the IEEE , volume =

Gradient-based learning applied to document recognition , author =. Proceedings of the IEEE , volume =

work page
[32]

CVPR , year=

Delving into the Estimation Shift of Batch Normalization in a Network , author=. CVPR , year=

work page
[33]

2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW) , year=

Leveraging Batch Normalization for Vision Transformers , author=. 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW) , year=

work page 2021
[34]

NeurIPS , year =

Sergey Ioffe , title =. NeurIPS , year =

work page
[35]

NeurIPS , year =

Online Normalization for Training Neural Networks , author =. NeurIPS , year =

work page
[36]

ICLR , year=

Towards Stabilizing Batch Statistics in Backward Propagation of Batch Normalization , author=. ICLR , year=

work page
[37]

ECCV , year=

Gradient Centralization: A New Optimization Technique for Deep Neural Networks , author=. ECCV , year=

work page
[38]

CVPR , year =

Yao, Zhuliang and Cao, Yue and Zheng, Shuxin and Huang, Gao and Lin, Stephen , title =. CVPR , year =

work page
[39]

Averaging Weights Leads to Wider Optima and Better Generalization

Averaging weights leads to wider optima and better generalization , author=. arXiv preprint arXiv:1803.05407 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[40]

ICLR , year=

Differentiable learning-to-normalize via switchable normalization , author=. ICLR , year=

work page
[41]

Improving robustness against common corruptions by covariate shift adaptation , year =

Schneider, Steffen and Rusak, Evgenia and Eck, Luisa and Bringmann, Oliver and Brendel, Wieland and Bethge, Matthias , booktitle =. Improving robustness against common corruptions by covariate shift adaptation , year =

work page
[42]

arXiv preprint arXiv:2006.10963 , year=

Evaluating Prediction-Time Batch Normalization for Robustness under Covariate Shift , author=. arXiv preprint arXiv:2006.10963 , year=

work page arXiv 2006
[43]

ICLR , year=

Four Things Everyone Should Know to Improve Batch Normalization , author=. ICLR , year=

work page
[44]

ICLR , year=

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , author=. ICLR , year=

work page
[45]

, title =

Raffel, Colin and Shazeer, Noam and Roberts, Adam and Lee, Katherine and Narang, Sharan and Matena, Michael and Zhou, Yanqi and Li, Wei and Liu, Peter J. , title =. J. Mach. Learn. Res. , month =. 2020 , volume =

work page 2020
[46]

End-to-End Object Detection with Transformers

Carion, Nicolas and Massa, Francisco and Synnaeve, Gabriel and Usunier, Nicolas and Kirillov, Alexander and Zagoruyko, Sergey. End-to-End Object Detection with Transformers. ECCV. 2020

work page 2020
[47]

and Kirillov, Alexander and Girdhar, Rohit , booktitle=

Cheng, Bowen and Misra, Ishan and Schwing, Alexander G. and Kirillov, Alexander and Girdhar, Rohit , booktitle=. Masked-attention Mask Transformer for Universal Image Segmentation , year=

work page
[48]

and Lo, Wan-Yen and Dollar, Piotr and Girshick, Ross , title =

Kirillov, Alexander and Mintun, Eric and Ravi, Nikhila and Mao, Hanzi and Rolland, Chloe and Gustafson, Laura and Xiao, Tete and Whitehead, Spencer and Berg, Alexander C. and Lo, Wan-Yen and Dollar, Piotr and Girshick, Ross , title =. ICCV , year =

work page
[49]

NeurIPS , year=

Flamingo: a visual language model for few-shot learning , author=. NeurIPS , year=

work page
[50]

NeurIPS , year=

Norm matters: efficient and accurate normalization schemes in deep networks , author=. NeurIPS , year=

work page
[51]

ICLR , year=

Theoretical Analysis of Auto Rate-Tuning by Batch Normalization , author=. ICLR , year=

work page
[52]

ICLR , year=

AN EXPONENTIAL LEARNING RATE SCHEDULE FOR BATCH NORMALIZED NETWORKS , author=. ICLR , year=

work page
[53]

IEEE Transactions on Pattern Analysis and Machine Intelligence , year=

Normalization techniques in training dnns: Methodology, analysis and application , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , year=

work page
[54]

NeurIPS , year=

Understanding the Failure of Batch Normalization for Transformers in NLP , author=. NeurIPS , year=

work page
[55]

Breakthroughs in statistics: Methodology and distribution , pages=

Statistical methods for research workers , author=. Breakthroughs in statistics: Methodology and distribution , pages=. 1970 , publisher=

work page 1970
[56]

Journal of the American statistical association , volume=

Hierarchical grouping to optimize an objective function , author=. Journal of the American statistical association , volume=. 1963 , publisher=

work page 1963
[57]

arXiv preprint arXiv:2106.10270 , year=

How to train your vit? data, augmentation, and regularization in vision transformers , author=. arXiv preprint arXiv:2106.10270 , year=

work page arXiv
[58]

ICLR , year=

Understanding deep learning requires rethinking generalization , author=. ICLR , year=

work page
[59]

ACL , year =

fairseq: A Fast, Extensible Toolkit for Sequence Modeling , author =. ACL , year =

work page
[60]

Almost Linear VC Dimension Bounds for Piecewise Polynomial Networks , year =

Bartlett, Peter and Maiorov, Vitaly and Meir, Ron , booktitle =. Almost Linear VC Dimension Bounds for Piecewise Polynomial Networks , year =

work page
[61]

CVPR , Year =

Kaiming He and Xiangyu Zhang and Shaoqing Ren and Jian Sun , Title =. CVPR , Year =

work page
[62]

NeurIPS , year=

Understanding the generalization benefit of normalization layers: Sharpness reduction , author=. NeurIPS , year=

work page
[63]

NeurIPS , year=

Beyond batchnorm: towards a unified understanding of normalization in deep learning , author=. NeurIPS , year=

work page
[64]

NeurIPS , year=

Proxy-Normalizing Activations to Match Batch Normalization while Removing Batch Dependence , author=. NeurIPS , year=

work page
[65]

NeurIPS , year=

Understanding and improving layer normalization , author=. NeurIPS , year=

work page
[66]

CVPR , year =

Huang, Lei and Zhou, Yi and Liu, Li and Zhu, Fan and Shao, Ling , title =. CVPR , year =

work page
[67]

ICML , year=

A Quantitative Analysis of the Effect of Batch Normalization on Gradient Descent , author=. ICML , year=

work page
[68]

Journal of approximation theory , volume=

Approximation by neural networks with a bounded number of nodes at each level , author=. Journal of approximation theory , volume=. 2003 , publisher=

work page 2003
[69]

NeurIPS , year=

How Does Batch Normalization Help Optimization? , author=. NeurIPS , year=

work page
[70]

NeurIPS , year =

The Normalization Method for Alleviating Pathological Sharpness in Wide Neural Networks , author =. NeurIPS , year =

work page
[71]

ICML , year =

An Investigation into Neural Net Optimization via Hessian Eigenvalue Density , author =. ICML , year =

work page
[72]

ICLR , year =

Ping Luo and Xinjiang Wang and Wenqi Shao and Zhanglin Peng , title =. ICLR , year =

work page
[73]

Grosse , title =

Guodong Zhang and Chaoqi Wang and Bowen Xu and Roger B. Grosse , title =. ICLR , year =

work page
[74]

Neurocomputing , volume=

Lower bounds for approximation by MLP neural networks , author=. Neurocomputing , volume=. 1999 , publisher=

work page 1999
[75]

arXiv preprint arXiv:2006.08859 , year=

Minimum width for universal approximation , author=. arXiv preprint arXiv:2006.08859 , year=

work page arXiv 2006
[76]

and Stinchcombe, M

Hornik, K. and Stinchcombe, M. and White, H. , journal =

work page
[77]

Proceedings of the European Conference on Computer Vision (ECCV) , year =

Wu, Yuxin and He, Kaiming , title =. Proceedings of the European Conference on Computer Vision (ECCV) , year =

work page
[78]

Natural Gradient Works Efficiently in Learning , author=

work page
[79]

ILLC , year =

Desmond Elliott and Stella Frank and Khalil Sima’an , title =. ILLC , year =

work page
[80]

ACL , year =

Kishore Papineni, Salim Roukos, Todd Ward and Wei-Jing Zhu , title =. ACL , year =

work page

Showing first 80 references.