pith. machine review for the scientific record. sign in

arxiv: 2605.14521 · v1 · submitted 2026-05-14 · 💻 cs.LG

Recognition: 1 theorem link

· Lean Theorem

Enjoy Your Layer Normalization with the Computational Efficiency of RMSNorm

Authors on Pith no claims yet

Pith reviewed 2026-05-15 02:19 UTC · model grok-4.3

classification 💻 cs.LG
keywords layer normalizationRMSNorminference accelerationfoldable LNcentering operationdeep neural networksexact equivalence
0
0 comments X

The pith

Many layer normalizations in standard networks can be folded exactly into upstream layers, allowing precise replacement by faster RMSNorm at inference time with no change in predictions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a way to test whether any given layer normalization can be replaced by RMSNorm without changing the model's computed function. The test works by checking if the centering step inside LN can be absorbed into the preceding linear layers through a zero-mean constraint on their outputs. When this condition holds, the LN is called foldable. A graph algorithm detects foldable cases across entire networks. In practice this covers many layers in common architectures, so the centering arithmetic can be dropped at inference while the numerical outputs stay identical.

Core claim

An LN is foldable when its centering operation can be moved into upstream linear layers by enforcing the column-centered constraint on activations and column-based weight centering on the weight matrices, leaving the overall input-output mapping unchanged. A graph-based detection procedure identifies all such foldable LNs in an arbitrary DNN. Once identified, each foldable LN converts exactly to an RMSNorm at inference, removing the per-sample mean subtraction while preserving exact equivalence to the original model.

What carries the argument

Foldable LN, identified by whether the centering step can be absorbed via column-centered constraint and column-based weight centering on upstream linear layers.

Load-bearing premise

Enforcing zero-mean outputs on upstream linear layers through column-centered constraints and weight centering leaves the overall model function unchanged.

What would settle it

Run the conversion on a network where the detection algorithm reports non-foldable LNs and measure whether the numerical outputs or task accuracy differ from the original LN model.

Figures

Figures reproduced from arXiv: 2605.14521 by Jie Luo, Lei Huang, Wenjun Wu, Yihao Yue, Yizhou Ruan, Yunhao Ni, Yuxin Guo.

Figure 1
Figure 1. Figure 1: Overview of the method. W denotes the weight matrix of a general linear layer, and W∗ is applied with CBWC which satisfies CCC. automatically detecting and replacing foldable LNs. Finally, in Section 5, we conduct comprehensive experiments to validate the effectiveness and efficiency of the proposed approach. 3. Equivalent Replacement of LN after Linear Layers In this section, we present a theoretical anal… view at source ↗
Figure 2
Figure 2. Figure 2: Sketch map of the two training scheme. The proof of Prop. 3.2 is given in Appendix B.2. Once CBWC is applied to the upstream general linear layer, RMS￾Norm can replace the LN while producing identical outputs and gradients, yet at lower computational cost. The resulting CBWC+RMSNorm scheme therefore yields parameters that are mathematically equivalent to those of the original LN￾based model, which we furth… view at source ↗
Figure 3
Figure 3. Figure 3: An example network and its associated zero-mean graph. Layer D is an LN, layer E is a scalar operation, and ⊕ denotes residual addition. Layer D is foldable if and only if vA, vB ∈ Vl. 4.2. Analysis of Neural Networks from a Graph Perspective We model the network as a directed graph G = (V, E). Each vertex v ∈ V corresponds to a layer, represented as a function f : R m → R n, for some m, n ∈ N+. Each direc… view at source ↗
Figure 4
Figure 4. Figure 4: Norm of the input to the final layer for MLPs of different depths (d). LN more effectively controls the sample norm before the final layer throughout training. 5.1. Observation of Centering Although the benefits of centering in LN remain debated, we observe that centering helps stabilize the activation range of a network. To examine this effect, we conduct abla￾tion experiments on MLPs of different depths … view at source ↗
Figure 6
Figure 6. Figure 6: Training loss curves and validation BLEU scores of Transformer on the Multi30K translation task. Our proposed CBWC+RMSNorm achieves final performance between standard LN and vanilla RMSNorm. (a) Training accuracy. (b) Test accuracy. (c) Training time. (d) Test time [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Training and test performance of Transformer-based text classification on the AG News dataset. The results are averaged over 5 random seeds with shaded regions indicating standard devi￾ation. Our CBWC+RMSNorm matches the convergence behavior and final accuracy of standard LN while achieving nearly the same training and inference throughput as vanilla RMSNorm. tions, thereby breaking exact equivalence. We t… view at source ↗
Figure 8
Figure 8. Figure 8: End-to-end latency comparison of Swin-Tiny variants for different processes on ImageNet-100. Our CBWC+RMSNorm delivers markedly lower latency than LN in both forward and backward passes, approaching the efficiency of vanilla RMSNorm. Text Classification. We train the models on the AG News (Zhang et al., 2015) dataset using the same Trans￾former setting as in the translation experiment. Each model is traine… view at source ↗
Figure 9
Figure 9. Figure 9: Sketch map of CCWT. To achieve the CCC of a general linear layer in inference and ensure a subsequent LN foldable, we propose column-centered weight transformation. The aim of CCC is to fold centering operation (in Eqn.4) into linear layer. Notice that given a layer input vector x = [x1, x2, . . . , xn] ⊤ ∈ R n, the centering operation can be written into the form of xe = [PITH_FULL_IMAGE:figures/full_fig… view at source ↗
Figure 10
Figure 10. Figure 10: The proof and application of our method on Post-LN. ‘Att’ and ‘FFN’ refer to the Attention layer and the Feed-Forward Network, which are both general linear layers. For the post-LN transformer, the residual structure connects an LN and an self-attention module or a feed-forward network layer. Obviously, all the LNs are foldable. D.3.3. PRE-LN TRANSFORMER Although the embedding block does not provide zero-… view at source ↗
Figure 11
Figure 11. Figure 11: The proof and application of our method on Pre-LN. ‘Att’ and ‘FFN’ refers to the Attention layer and Feed-Forward Network, which are both general linear layers. ‘Embed’ refers to the Embedding layer, which it is not. E. Implementation Details of Algorithm 1 E.1. Foldable LN in Common Models Here, we list 11 common models and the number of LN and foldable LN in [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Mean of the final layer’s input for MLPs of different depth (d). The change is similar to the change of norm. 0 1 2 3 4 5 6 Depth (Linear Layers) 200 300 400 500 600 Input Norm LN RMSNorm (a) d=6 on CIFAR-10. 0 1 2 3 4 5 6 7 8 9 101112131415 Depth (Linear Layers) 200 250 300 350 400 Input Norm LN RMSNorm (b) d=15 on CIFAR-10. 0 5 10 15 20 25 30 35 Depth (Linear Layers) 200 250 300 350 400 Input Norm LN RM… view at source ↗
Figure 13
Figure 13. Figure 13: Norm of the input across linear layers for MLPs of different depth (d). LN better controls the norm of samples throughout the model. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Norm of the input across activation layers for MLPs of different depth d. The change is similar to the change of norm. 0 1 2 3 4 5 Depth (Normalization Layers) 110 120 130 140 Input Norm LN RMSNorm (a) d=6 on CIFAR-10. 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Depth (Normalization Layers) 120 130 140 150 160 170 Input Norm LN RMSNorm (b) d=15 on CIFAR-10. 0 5 10 15 20 25 30 34 Depth (Normalization Layers) 120 13… view at source ↗
Figure 15
Figure 15. Figure 15: Norm of the input across normalization layers for MLPs of different depth d. We also observed the variations in input norms across other layers. As depicted in [PITH_FULL_IMAGE:figures/full_fig_p023_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Inference latency comparison across six representative models (GPT-2, BERT, BLOOM, OPT, Phi-3, and ViT). Our CBWC+RMSNorm achieves a consistent end-to-end runtime reduction of 2%–12% with no accuracy degradation. We also measure the proportion of inference runtime originally spent in LN for these models, as reported in [PITH_FULL_IMAGE:figures/full_fig_p027_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Inference latency comparison under longer sequence lengths. Our CBWC+RMSNorm achieves larger end-to-end speedups as sequence length increases. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Performance of transformer models for text classification task of Transformer on the AG News dataset. The results are averaged over 5 random seeds with shaded regions indicating standard deviation The experiment is conducted on a 3090Ti. The average results are reported in [PITH_FULL_IMAGE:figures/full_fig_p031_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Train accuracy of three models and learning rate setting of SWIN transformer on Imagenet100 under high learning rate setting. Under a higher learning rate (10−3 ), CBWC+RMS performs between the LN and RMSNorm variants. As shown in [PITH_FULL_IMAGE:figures/full_fig_p032_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Performance of the three models in SWIN transformer on Imagenet100 under small learning rate. Under a smaller learning rate (10−5 ), the accuracy and loss of LN variant and RMSNorm are almost the same, while our method has different performance. Our method shows slightly better generalization performance than the variants using only RMSNorm or LN, with slightly higher test accuracy and lower test loss (as… view at source ↗
Figure 21
Figure 21. Figure 21: Comparison of WA and WB. H.4.2. EXPERIMENT SETTING FOR TRANSFORMER We pre-train the model with a learning rate of 5 × 10−4 and an effective batch size of 128 (batch size is 2 and gradient accumulation is 64). For fine-tuning, we have a learning rate of 5 × 10−5 and an effective batch size of 16 (batch size is 2 and gradient accumulation is 8). I. Open-source Code To facilitate verification and further exp… view at source ↗
read the original abstract

Layer normalization (LN) is a fundamental component in modern deep learning, but its per-sample centering and scaling introduce non-negligible inference overhead. RMSNorm improves efficiency by removing the centering operation, yet this may discard benefits associated with centering. This paper propose a framework to determine whether an LN in an arbitrary DNN can be replaced by RMSNorm without changing the model function. The key idea is to fold LN's centering operation into upstream general linear layers by enforcing zero-mean outputs through the column-centered constraint (CCC) and column-based weight centering (CBWC). We extend the analysis to arbitrary DNNs, define such LNs as foldable LNs, and develop a graph-based detection algorithm. Our analysis shows that many LNs in widely used architectures are foldable, enabling exact inference-time conversion and end-to-end acceleration of 2% to 12% without changing model predictions. Experiments across multiple task families further show that, when exact equivalence is partially broken in practical training settings, our method remains competitive with vanilla LN while improving efficiency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents a framework to replace Layer Normalization (LN) with RMSNorm in arbitrary DNNs by folding LN's centering into upstream linear layers. This is achieved by enforcing the column-centered constraint (CCC) and column-based weight centering (CBWC) on those layers, defining such LNs as 'foldable,' and providing a graph-based detection algorithm. The paper claims many LNs in widely used architectures are foldable, enabling exact inference-time conversion with 2-12% end-to-end acceleration without changing model predictions; experiments show competitiveness with vanilla LN even when exact equivalence is relaxed in training.

Significance. If the exact equivalence and foldability claims hold, the work would enable practical efficiency gains in inference for transformers and similar models by using faster RMSNorm while preserving centering benefits where possible. The graph-based detection algorithm and analysis of foldable LNs across architectures represent a useful contribution for model optimization. However, the low soundness rating and missing full derivations limit immediate impact until verified.

major comments (2)
  1. [§3] §3 (framework and CCC/CBWC definitions): The central claim of exact model-function preservation when folding LN centering via CCC and CBWC on upstream linear layers lacks a complete, self-contained derivation for arbitrary computation graphs; this is load-bearing for the foldability definition and graph-based detection algorithm.
  2. [Experiments] Experiments section: The reported 2-12% acceleration and competitive results when relaxing exact equivalence are stated without accompanying tables, specific model architectures, or quantitative details on how CBWC is enforced during training, making it impossible to assess whether the weakest assumption (no major retraining adjustments needed) holds.
minor comments (2)
  1. [Abstract] Abstract: 'This paper propose' contains a subject-verb agreement error and should read 'This paper proposes'.
  2. [§2] Notation for CCC and CBWC is introduced without an explicit summary table relating them to standard LN equations, which would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments below and will revise the manuscript to strengthen the presentation of the framework and experiments.

read point-by-point responses
  1. Referee: [§3] §3 (framework and CCC/CBWC definitions): The central claim of exact model-function preservation when folding LN centering via CCC and CBWC on upstream linear layers lacks a complete, self-contained derivation for arbitrary computation graphs; this is load-bearing for the foldability definition and graph-based detection algorithm.

    Authors: We agree that a fully self-contained derivation would improve clarity. In the revised manuscript we will insert a new subsection in §3 that derives, from first principles and for arbitrary directed acyclic computation graphs, that applying CCC to the upstream linear-layer outputs together with CBWC on the weights yields exact numerical equivalence between the original LN and the folded RMSNorm. The graph-based detection algorithm will then be shown to follow directly as a reachability query on the resulting constraint graph. revision: yes

  2. Referee: [Experiments] Experiments section: The reported 2-12% acceleration and competitive results when relaxing exact equivalence are stated without accompanying tables, specific model architectures, or quantitative details on how CBWC is enforced during training, making it impossible to assess whether the weakest assumption (no major retraining adjustments needed) holds.

    Authors: We accept that the current experimental description is insufficiently detailed. The revision will add a dedicated experimental subsection containing (i) tables with per-model latency and throughput numbers on BERT-base, GPT-2, and ViT-B/16, (ii) the precise regularization coefficient and schedule used to enforce CBWC, and (iii) ablation results confirming that the same training hyper-parameters as vanilla LN suffice. These additions will make the 2–12 % end-to-end gains and the “no major retraining” claim verifiable. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained via explicit constraints

full rationale

The paper's core chain defines foldable LNs via the column-centered constraint (CCC) and column-based weight centering (CBWC) applied to upstream linear layers, then uses a graph-based algorithm to detect them in arbitrary DNNs. This allows exact folding of LN centering into RMSNorm at inference without altering the model function. No step reduces a prediction to a fitted parameter, renames a known result, or relies on a self-citation chain for the uniqueness or validity of the equivalence; the constraints are stated as enforceable properties that preserve overall function by construction. The reported speedups follow directly from the detection and conversion procedure rather than from any self-referential input.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that centering can be exactly folded via zero-mean constraints on linear layers; no free parameters or new entities are introduced.

axioms (1)
  • domain assumption The centering operation of LN can be exactly folded into upstream linear layers by enforcing zero-mean outputs through CCC and CBWC.
    This premise enables the exact equivalence claim and is invoked to define foldable LNs.

pith-pipeline@v0.9.0 · 5494 in / 1158 out tokens · 100106 ms · 2026-05-15T02:19:31.026410+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

105 extracted references · 105 canonical work pages · 8 internal anchors

  1. [1]

    ICML , year =

    Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , author =. ICML , year =

  2. [2]

    arXiv preprint arXiv:1906.05849 , year=

    Contrastive Multiview Coding , author=. arXiv preprint arXiv:1906.05849 , year=

  3. [3]

    Chun-Hsiao Yeh, Yubei Chen , howpublished=

  4. [4]

    Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , year=

    Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , year=

  5. [5]

    2016 , eprint=

    Layer Normalization , author=. 2016 , eprint=

  6. [6]

    Root Mean Square Layer Normalization , year =

    Zhang, Biao and Sennrich, Rico , booktitle =. Root Mean Square Layer Normalization , year =

  7. [7]

    ICML , year =

    Xiong, Ruibin and Yang, Yunchang and He, Di and Zheng, Kai and Zheng, Shuxin and Xing, Chen and Zhang, Huishuai and Lan, Yanyan and Wang, Liwei and Liu, Tie-Yan , title =. ICML , year =

  8. [8]

    ICML , year=

    PowerNorm: Rethinking Batch Normalization in Transformers , author=. ICML , year=

  9. [9]

    Attention is All you Need , year =

    Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, ukasz and Polosukhin, Illia , booktitle =. Attention is All you Need , year =

  10. [10]

    BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding

    Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina. BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding. ACL. 2019

  11. [11]

    How Does Batch Normalization Help Optimization? , year =

    Santurkar, Shibani and Tsipras, Dimitris and Ilyas, Andrew and Madry, Aleksander , booktitle =. How Does Batch Normalization Help Optimization? , year =

  12. [12]

    Understanding Batch Normalization , year =

    Bjorck, Nils and Gomes, Carla P and Selman, Bart and Weinberger, Kilian Q , booktitle =. Understanding Batch Normalization , year =

  13. [13]

    LeCun, Yann and Bengio, Yoshua and Hinton, Geoffrey , journal =

  14. [14]

    Query-Key Normalization for Transformers

    Henry, Alex and Dachapally, Prudhvi Raj and Pawar, Shubham Shantaram and Chen, Yuxuan. Query-Key Normalization for Transformers. EMNLP. 2020

  15. [15]

    Transformer- XL : Attentive Language Models beyond a Fixed-Length Context

    Dai, Zihang and Yang, Zhilin and Yang, Yiming and Carbonell, Jaime and Le, Quoc and Salakhutdinov, Ruslan. Transformer- XL : Attentive Language Models beyond a Fixed-Length Context. ACL. 2019

  16. [16]

    Bach and Thomas Hofmann and Aurélien Lucchi , title=

    Hadi Daneshmand and Jonas Moritz Kohler and Francis R. Bach and Thomas Hofmann and Aurélien Lucchi , title=. 2020 , booktitle=

  17. [17]

    NeurIPS , year=

    Batch Normalization Orthogonalizes Representations in Deep Random Networks , author=. NeurIPS , year=

  18. [18]

    Huang and Y

    L. Huang and Y. Zhou and F. Zhu and L. Liu and L. Shao , booktitle =. Iterative Normalization: Beyond Standardization Towards Efficient Whitening , year =

  19. [19]

    Layer-Wise Conditioning Analysis in Exploring the Learning Dynamics of DNNs

    Huang, Lei and Qin, Jie and Liu, Li and Zhu, Fan and Shao, Ling. Layer-Wise Conditioning Analysis in Exploring the Learning Dynamics of DNNs. ECCV. 2020

  20. [20]

    ICLR , year=

    Towards Understanding Regularization in Batch Normalization , author=. ICLR , year=

  21. [21]

    CoRR , volume =

    Yuxin Wu and Justin Johnson , title =. CoRR , volume =. 2021 , eprinttype =

  22. [22]

    Improving language understanding by generative pre-training , author=

  23. [23]

    Language Models are Unsupervised Multitask Learners , author=

  24. [24]

    Language Models are Few-Shot Learners , year =

    Brown, Tom and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared D and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and Agarwal, Sandhini and others , booktitle =. Language Models are Few-Shot Learners , year =

  25. [25]

    Adam: A Method for Stochastic Optimization

    Kingma, Diederik P and Ba, Jimmy , interhash =. arXiv preprint arXiv:1412.6980 , keywords =

  26. [26]

    ArXiv , year=

    Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling , author=. ArXiv , year=

  27. [27]

    and Fei-Fei, Li and Dong, Wei and Li, Kai and Li, Li-Jia , booktitle =

    Deng, Jia and Socher, R. and Fei-Fei, Li and Dong, Wei and Li, Kai and Li, Li-Jia , booktitle =

  28. [28]

    , booktitle =

    Krizhevsky, Alex and Sutskever, Ilya and Hinton, Geoffrey E. , booktitle =

  29. [29]

    OPT: Open Pre-trained Transformer Language Models

    Opt: Open pre-trained transformer language models , author=. arXiv preprint arXiv:2205.01068 , year=

  30. [30]

    Learning Multiple Layers of Features from Tiny Images , author=

  31. [31]

    Proceedings of the IEEE , volume =

    Gradient-based learning applied to document recognition , author =. Proceedings of the IEEE , volume =

  32. [32]

    CVPR , year=

    Delving into the Estimation Shift of Batch Normalization in a Network , author=. CVPR , year=

  33. [33]

    2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW) , year=

    Leveraging Batch Normalization for Vision Transformers , author=. 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW) , year=

  34. [34]

    NeurIPS , year =

    Sergey Ioffe , title =. NeurIPS , year =

  35. [35]

    NeurIPS , year =

    Online Normalization for Training Neural Networks , author =. NeurIPS , year =

  36. [36]

    ICLR , year=

    Towards Stabilizing Batch Statistics in Backward Propagation of Batch Normalization , author=. ICLR , year=

  37. [37]

    ECCV , year=

    Gradient Centralization: A New Optimization Technique for Deep Neural Networks , author=. ECCV , year=

  38. [38]

    CVPR , year =

    Yao, Zhuliang and Cao, Yue and Zheng, Shuxin and Huang, Gao and Lin, Stephen , title =. CVPR , year =

  39. [39]

    Averaging Weights Leads to Wider Optima and Better Generalization

    Averaging weights leads to wider optima and better generalization , author=. arXiv preprint arXiv:1803.05407 , year=

  40. [40]

    ICLR , year=

    Differentiable learning-to-normalize via switchable normalization , author=. ICLR , year=

  41. [41]

    Improving robustness against common corruptions by covariate shift adaptation , year =

    Schneider, Steffen and Rusak, Evgenia and Eck, Luisa and Bringmann, Oliver and Brendel, Wieland and Bethge, Matthias , booktitle =. Improving robustness against common corruptions by covariate shift adaptation , year =

  42. [42]

    arXiv preprint arXiv:2006.10963 , year=

    Evaluating Prediction-Time Batch Normalization for Robustness under Covariate Shift , author=. arXiv preprint arXiv:2006.10963 , year=

  43. [43]

    ICLR , year=

    Four Things Everyone Should Know to Improve Batch Normalization , author=. ICLR , year=

  44. [44]

    ICLR , year=

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , author=. ICLR , year=

  45. [45]

    , title =

    Raffel, Colin and Shazeer, Noam and Roberts, Adam and Lee, Katherine and Narang, Sharan and Matena, Michael and Zhou, Yanqi and Li, Wei and Liu, Peter J. , title =. J. Mach. Learn. Res. , month =. 2020 , volume =

  46. [46]

    End-to-End Object Detection with Transformers

    Carion, Nicolas and Massa, Francisco and Synnaeve, Gabriel and Usunier, Nicolas and Kirillov, Alexander and Zagoruyko, Sergey. End-to-End Object Detection with Transformers. ECCV. 2020

  47. [47]

    and Kirillov, Alexander and Girdhar, Rohit , booktitle=

    Cheng, Bowen and Misra, Ishan and Schwing, Alexander G. and Kirillov, Alexander and Girdhar, Rohit , booktitle=. Masked-attention Mask Transformer for Universal Image Segmentation , year=

  48. [48]

    and Lo, Wan-Yen and Dollar, Piotr and Girshick, Ross , title =

    Kirillov, Alexander and Mintun, Eric and Ravi, Nikhila and Mao, Hanzi and Rolland, Chloe and Gustafson, Laura and Xiao, Tete and Whitehead, Spencer and Berg, Alexander C. and Lo, Wan-Yen and Dollar, Piotr and Girshick, Ross , title =. ICCV , year =

  49. [49]

    NeurIPS , year=

    Flamingo: a visual language model for few-shot learning , author=. NeurIPS , year=

  50. [50]

    NeurIPS , year=

    Norm matters: efficient and accurate normalization schemes in deep networks , author=. NeurIPS , year=

  51. [51]

    ICLR , year=

    Theoretical Analysis of Auto Rate-Tuning by Batch Normalization , author=. ICLR , year=

  52. [52]

    ICLR , year=

    AN EXPONENTIAL LEARNING RATE SCHEDULE FOR BATCH NORMALIZED NETWORKS , author=. ICLR , year=

  53. [53]

    IEEE Transactions on Pattern Analysis and Machine Intelligence , year=

    Normalization techniques in training dnns: Methodology, analysis and application , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , year=

  54. [54]

    NeurIPS , year=

    Understanding the Failure of Batch Normalization for Transformers in NLP , author=. NeurIPS , year=

  55. [55]

    Breakthroughs in statistics: Methodology and distribution , pages=

    Statistical methods for research workers , author=. Breakthroughs in statistics: Methodology and distribution , pages=. 1970 , publisher=

  56. [56]

    Journal of the American statistical association , volume=

    Hierarchical grouping to optimize an objective function , author=. Journal of the American statistical association , volume=. 1963 , publisher=

  57. [57]

    arXiv preprint arXiv:2106.10270 , year=

    How to train your vit? data, augmentation, and regularization in vision transformers , author=. arXiv preprint arXiv:2106.10270 , year=

  58. [58]

    ICLR , year=

    Understanding deep learning requires rethinking generalization , author=. ICLR , year=

  59. [59]

    ACL , year =

    fairseq: A Fast, Extensible Toolkit for Sequence Modeling , author =. ACL , year =

  60. [60]

    Almost Linear VC Dimension Bounds for Piecewise Polynomial Networks , year =

    Bartlett, Peter and Maiorov, Vitaly and Meir, Ron , booktitle =. Almost Linear VC Dimension Bounds for Piecewise Polynomial Networks , year =

  61. [61]

    CVPR , Year =

    Kaiming He and Xiangyu Zhang and Shaoqing Ren and Jian Sun , Title =. CVPR , Year =

  62. [62]

    NeurIPS , year=

    Understanding the generalization benefit of normalization layers: Sharpness reduction , author=. NeurIPS , year=

  63. [63]

    NeurIPS , year=

    Beyond batchnorm: towards a unified understanding of normalization in deep learning , author=. NeurIPS , year=

  64. [64]

    NeurIPS , year=

    Proxy-Normalizing Activations to Match Batch Normalization while Removing Batch Dependence , author=. NeurIPS , year=

  65. [65]

    NeurIPS , year=

    Understanding and improving layer normalization , author=. NeurIPS , year=

  66. [66]

    CVPR , year =

    Huang, Lei and Zhou, Yi and Liu, Li and Zhu, Fan and Shao, Ling , title =. CVPR , year =

  67. [67]

    ICML , year=

    A Quantitative Analysis of the Effect of Batch Normalization on Gradient Descent , author=. ICML , year=

  68. [68]

    Journal of approximation theory , volume=

    Approximation by neural networks with a bounded number of nodes at each level , author=. Journal of approximation theory , volume=. 2003 , publisher=

  69. [69]

    NeurIPS , year=

    How Does Batch Normalization Help Optimization? , author=. NeurIPS , year=

  70. [70]

    NeurIPS , year =

    The Normalization Method for Alleviating Pathological Sharpness in Wide Neural Networks , author =. NeurIPS , year =

  71. [71]

    ICML , year =

    An Investigation into Neural Net Optimization via Hessian Eigenvalue Density , author =. ICML , year =

  72. [72]

    ICLR , year =

    Ping Luo and Xinjiang Wang and Wenqi Shao and Zhanglin Peng , title =. ICLR , year =

  73. [73]

    Grosse , title =

    Guodong Zhang and Chaoqi Wang and Bowen Xu and Roger B. Grosse , title =. ICLR , year =

  74. [74]

    Neurocomputing , volume=

    Lower bounds for approximation by MLP neural networks , author=. Neurocomputing , volume=. 1999 , publisher=

  75. [75]

    arXiv preprint arXiv:2006.08859 , year=

    Minimum width for universal approximation , author=. arXiv preprint arXiv:2006.08859 , year=

  76. [76]

    and Stinchcombe, M

    Hornik, K. and Stinchcombe, M. and White, H. , journal =

  77. [77]

    Proceedings of the European Conference on Computer Vision (ECCV) , year =

    Wu, Yuxin and He, Kaiming , title =. Proceedings of the European Conference on Computer Vision (ECCV) , year =

  78. [78]

    Natural Gradient Works Efficiently in Learning , author=

  79. [79]

    ILLC , year =

    Desmond Elliott and Stella Frank and Khalil Sima’an , title =. ILLC , year =

  80. [80]

    ACL , year =

    Kishore Papineni, Salim Roukos, Todd Ward and Wei-Jing Zhu , title =. ACL , year =

Showing first 80 references.