{"work":{"id":"20a2d720-0046-4c7c-bcd6-327ec8143f69","openalex_id":null,"doi":null,"arxiv_id":"1607.06450","raw_key":null,"title":"Layer Normalization","authors":null,"authors_text":"Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E","year":2016,"venue":"stat.ML","abstract":"Training state-of-the-art, deep neural networks is computationally expensive. One way to reduce the training time is to normalize the activities of the neurons. A recently introduced technique called batch normalization uses the distribution of the summed input to a neuron over a mini-batch of training cases to compute a mean and variance which are then used to normalize the summed input to that neuron on each training case. This significantly reduces the training time in feed-forward neural networks. However, the effect of batch normalization is dependent on the mini-batch size and it is not obvious how to apply it to recurrent neural networks. In this paper, we transpose batch normalization into layer normalization by computing the mean and variance used for normalization from all of the summed inputs to the neurons in a layer on a single training case. Like batch normalization, we also give each neuron its own adaptive bias and gain which are applied after the normalization but before the non-linearity. Unlike batch normalization, layer normalization performs exactly the same computation at training and test times. It is also straightforward to apply to recurrent neural networks by computing the normalization statistics separately at each time step. Layer normalization is very effective at stabilizing the hidden state dynamics in recurrent networks. Empirically, we show that layer normalization can substantially reduce the training time compared with previously published techniques.","external_url":"https://arxiv.org/abs/1607.06450","cited_by_count":null,"metadata_source":"pith","metadata_fetched_at":"2026-05-14T22:33:10.556034+00:00","pith_arxiv_id":"1607.06450","created_at":"2026-05-08T19:24:04.838721+00:00","updated_at":"2026-05-14T22:33:10.556034+00:00","title_quality_ok":false,"display_title":"Layer Normalization","render_title":"Layer Normalization"},"hub":{"state":{"work_id":"20a2d720-0046-4c7c-bcd6-327ec8143f69","tier":"super_hub","tier_reason":"100+ Pith inbound or 10,000+ external citations","pith_inbound_count":167,"external_cited_by_count":null,"distinct_field_count":36,"first_pith_cited_at":"2017-06-12T17:57:34+00:00","last_pith_cited_at":"2026-05-13T15:04:09+00:00","author_build_status":"needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-05-14T23:06:18.159725+00:00","tier_text":"super_hub"},"tier":"super_hub","role_counts":[{"context_role":"background","n":3},{"context_role":"method","n":2}],"polarity_counts":[{"context_polarity":"background","n":3},{"context_polarity":"use_method","n":2}],"runs":{"ask_index":{"job_type":"ask_index","status":"succeeded","result":{"title":"Layer Normalization","claims":[{"claim_text":"Training state-of-the-art, deep neural networks is computationally expensive. One way to reduce the training time is to normalize the activities of the neurons. A recently introduced technique called batch normalization uses the distribution of the summed input to a neuron over a mini-batch of training cases to compute a mean and variance which are then used to normalize the summed input to that neuron on each training case. This significantly reduces the training time in feed-forward neural networks. However, the effect of batch normalization is dependent on the mini-batch size and it is not ","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Layer Normalization because it crossed a citation-hub threshold.","role_counts":[]},"error":null,"updated_at":"2026-05-13T21:23:33.643405+00:00"},"author_expand":{"job_type":"author_expand","status":"succeeded","result":{"authors_linked":[{"id":"ba54de2b-4970-4769-b6e6-99b6c8c5a0b8","orcid":null,"display_name":"Jimmy Lei Ba"},{"id":"b40778e0-9514-4d4f-aff0-890c43da4442","orcid":null,"display_name":"Jamie Ryan Kiros"},{"id":"86d5e2d2-d95e-4c35-a74a-cb4c2e005b0f","orcid":null,"display_name":"and Geoffrey E"}]},"error":null,"updated_at":"2026-05-13T21:23:32.658466+00:00"},"context_extract":{"job_type":"context_extract","status":"succeeded","result":{"enqueued_papers":25},"error":null,"updated_at":"2026-05-13T21:23:33.640276+00:00"},"graph_features":{"job_type":"graph_features","status":"succeeded","result":{"co_cited":[{"title":"Gaussian Error Linear Units (GELUs)","work_id":"0466fd22-03a1-4a61-af0a-a900e77bb023","shared_citers":27},{"title":"Adam: A Method for Stochastic Optimization","work_id":"1910796d-9b52-4683-bf5c-de9632c1028b","shared_citers":25},{"title":"Decoupled Weight Decay Regularization","work_id":"07ef7360-d385-4033-83f7-8384a6325204","shared_citers":21},{"title":"Scaling Laws for Neural Language Models","work_id":"b7dd8749-9c45-4977-ab9b-64478dce1ae8","shared_citers":21},{"title":"An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale","work_id":"e96730e3-129b-4db6-b981-15ab7932e297","shared_citers":16},{"title":"GLU Variants Improve Transformer","work_id":"17d0763c-1016-41ab-a478-478e890765eb","shared_citers":14},{"title":"Attention Is All You Need","work_id":"baafb5a2-5272-43bc-932f-09fa9ffe5316","shared_citers":12},{"title":"LLaMA: Open and Efficient Foundation Language Models","work_id":"c018fc23-6f3f-4035-9d02-28a2173b2b9d","shared_citers":12},{"title":"The Llama 3 Herd of Models","work_id":"1549a635-88af-4ac1-acfe-51ae7bb53345","shared_citers":12},{"title":"BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding","work_id":"ed240a10-5b19-406c-baa5-30803f465785","shared_citers":11},{"title":"Proximal Policy Optimization Algorithms","work_id":"240c67fe-d14d-4520-91c1-38a4e272ca19","shared_citers":11},{"title":"Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism","work_id":"c888e6d1-0b1d-43d6-9ef5-f0912a0efa1b","shared_citers":10},{"title":"Training Deep Nets with Sublinear Memory Cost","work_id":"f2c5c287-a500-40e4-a136-e7e3172db1d7","shared_citers":10},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":9},{"title":"Llama 2: Open Foundation and Fine-Tuned Chat Models","work_id":"68a5177f-d644-44c1-bd4f-4e5278c22f5d","shared_citers":9},{"title":"On the Opportunities and Risks of Foundation Models","work_id":"a18039e9-928d-47c9-a836-32656a71bf71","shared_citers":9},{"title":"RoBERTa: A Robustly Optimized BERT Pretraining Approach","work_id":"41fe12c4-e538-4890-a244-480650ed3078","shared_citers":9},{"title":"Training Compute-Optimal Large Language Models","work_id":"b2faf28d-86b7-429c-bc42-469458efc246","shared_citers":9},{"title":"Generating Long Sequences with Sparse Transformers","work_id":"c5b81688-45ee-4a9a-b095-e6290f45cb6c","shared_citers":8},{"title":"Language Models are Few-Shot Learners","work_id":"214732c0-2edd-44a0-af9e-28184a2b8279","shared_citers":8},{"title":"Searching for Activation Functions","work_id":"3a43a02d-e005-47ad-8373-c166e20c9ee9","shared_citers":8},{"title":"A simple method for commonsense reasoning","work_id":"a8423cfc-3f91-4307-9c05-02cfd6a0c714","shared_citers":7},{"title":"Evaluating Large Language Models Trained on Code","work_id":"042493e9-b26f-4b4e-bbde-382072ca9b08","shared_citers":7},{"title":"Mixed Precision Training","work_id":"c525941b-ce20-4bcb-8509-a9968f1e89c3","shared_citers":7}],"time_series":[{"n":2,"year":2017},{"n":4,"year":2018},{"n":3,"year":2019},{"n":3,"year":2020},{"n":2,"year":2021},{"n":6,"year":2022},{"n":9,"year":2023},{"n":2,"year":2024},{"n":1,"year":2025},{"n":130,"year":2026}]},"error":null,"updated_at":"2026-05-13T21:23:34.314494+00:00"},"identity_refresh":{"job_type":"identity_refresh","status":"succeeded","result":{"fixed":1,"items":[{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","resolver":"local_arxiv","confidence":0.98,"old_work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e"}],"errors":[],"attempted":1},"error":null,"updated_at":"2026-05-13T21:23:36.796286+00:00"},"role_polarity":{"job_type":"role_polarity","status":"succeeded","result":{"title":"Layer Normalization","claims":[{"claim_text":"Training state-of-the-art, deep neural networks is computationally expensive. One way to reduce the training time is to normalize the activities of the neurons. A recently introduced technique called batch normalization uses the distribution of the summed input to a neuron over a mini-batch of training cases to compute a mean and variance which are then used to normalize the summed input to that neuron on each training case. This significantly reduces the training time in feed-forward neural networks. However, the effect of batch normalization is dependent on the mini-batch size and it is not ","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Layer Normalization because it crossed a citation-hub threshold.","role_counts":[]},"error":null,"updated_at":"2026-05-13T21:23:32.661145+00:00"},"summary_claims":{"job_type":"summary_claims","status":"succeeded","result":{"title":"Layer Normalization","claims":[{"claim_text":"Training state-of-the-art, deep neural networks is computationally expensive. One way to reduce the training time is to normalize the activities of the neurons. A recently introduced technique called batch normalization uses the distribution of the summed input to a neuron over a mini-batch of training cases to compute a mean and variance which are then used to normalize the summed input to that neuron on each training case. This significantly reduces the training time in feed-forward neural networks. However, the effect of batch normalization is dependent on the mini-batch size and it is not ","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Layer Normalization because it crossed a citation-hub threshold.","role_counts":[]},"error":null,"updated_at":"2026-05-13T21:23:32.662721+00:00"}},"summary":{"title":"Layer Normalization","claims":[{"claim_text":"Training state-of-the-art, deep neural networks is computationally expensive. One way to reduce the training time is to normalize the activities of the neurons. A recently introduced technique called batch normalization uses the distribution of the summed input to a neuron over a mini-batch of training cases to compute a mean and variance which are then used to normalize the summed input to that neuron on each training case. This significantly reduces the training time in feed-forward neural networks. However, the effect of batch normalization is dependent on the mini-batch size and it is not ","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Layer Normalization because it crossed a citation-hub threshold.","role_counts":[]},"graph":{"co_cited":[{"title":"Gaussian Error Linear Units (GELUs)","work_id":"0466fd22-03a1-4a61-af0a-a900e77bb023","shared_citers":27},{"title":"Adam: A Method for Stochastic Optimization","work_id":"1910796d-9b52-4683-bf5c-de9632c1028b","shared_citers":25},{"title":"Decoupled Weight Decay Regularization","work_id":"07ef7360-d385-4033-83f7-8384a6325204","shared_citers":21},{"title":"Scaling Laws for Neural Language Models","work_id":"b7dd8749-9c45-4977-ab9b-64478dce1ae8","shared_citers":21},{"title":"An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale","work_id":"e96730e3-129b-4db6-b981-15ab7932e297","shared_citers":16},{"title":"GLU Variants Improve Transformer","work_id":"17d0763c-1016-41ab-a478-478e890765eb","shared_citers":14},{"title":"Attention Is All You Need","work_id":"baafb5a2-5272-43bc-932f-09fa9ffe5316","shared_citers":12},{"title":"LLaMA: Open and Efficient Foundation Language Models","work_id":"c018fc23-6f3f-4035-9d02-28a2173b2b9d","shared_citers":12},{"title":"The Llama 3 Herd of Models","work_id":"1549a635-88af-4ac1-acfe-51ae7bb53345","shared_citers":12},{"title":"BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding","work_id":"ed240a10-5b19-406c-baa5-30803f465785","shared_citers":11},{"title":"Proximal Policy Optimization Algorithms","work_id":"240c67fe-d14d-4520-91c1-38a4e272ca19","shared_citers":11},{"title":"Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism","work_id":"c888e6d1-0b1d-43d6-9ef5-f0912a0efa1b","shared_citers":10},{"title":"Training Deep Nets with Sublinear Memory Cost","work_id":"f2c5c287-a500-40e4-a136-e7e3172db1d7","shared_citers":10},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":9},{"title":"Llama 2: Open Foundation and Fine-Tuned Chat Models","work_id":"68a5177f-d644-44c1-bd4f-4e5278c22f5d","shared_citers":9},{"title":"On the Opportunities and Risks of Foundation Models","work_id":"a18039e9-928d-47c9-a836-32656a71bf71","shared_citers":9},{"title":"RoBERTa: A Robustly Optimized BERT Pretraining Approach","work_id":"41fe12c4-e538-4890-a244-480650ed3078","shared_citers":9},{"title":"Training Compute-Optimal Large Language Models","work_id":"b2faf28d-86b7-429c-bc42-469458efc246","shared_citers":9},{"title":"Generating Long Sequences with Sparse Transformers","work_id":"c5b81688-45ee-4a9a-b095-e6290f45cb6c","shared_citers":8},{"title":"Language Models are Few-Shot Learners","work_id":"214732c0-2edd-44a0-af9e-28184a2b8279","shared_citers":8},{"title":"Searching for Activation Functions","work_id":"3a43a02d-e005-47ad-8373-c166e20c9ee9","shared_citers":8},{"title":"A simple method for commonsense reasoning","work_id":"a8423cfc-3f91-4307-9c05-02cfd6a0c714","shared_citers":7},{"title":"Evaluating Large Language Models Trained on Code","work_id":"042493e9-b26f-4b4e-bbde-382072ca9b08","shared_citers":7},{"title":"Mixed Precision Training","work_id":"c525941b-ce20-4bcb-8509-a9968f1e89c3","shared_citers":7}],"time_series":[{"n":2,"year":2017},{"n":4,"year":2018},{"n":3,"year":2019},{"n":3,"year":2020},{"n":2,"year":2021},{"n":6,"year":2022},{"n":9,"year":2023},{"n":2,"year":2024},{"n":1,"year":2025},{"n":130,"year":2026}]},"authors":[{"id":"86d5e2d2-d95e-4c35-a74a-cb4c2e005b0f","orcid":null,"display_name":"and Geoffrey E","source":"manual","import_confidence":0.72},{"id":"b40778e0-9514-4d4f-aff0-890c43da4442","orcid":null,"display_name":"Jamie Ryan Kiros","source":"manual","import_confidence":0.72},{"id":"ba54de2b-4970-4769-b6e6-99b6c8c5a0b8","orcid":null,"display_name":"Jimmy Lei Ba","source":"manual","import_confidence":0.72}]}}