super hub Mixed citations

Layer Normalization

Jamie Ryan Kiros, Jimmy Lei Ba · 2016 · stat.ML · arXiv 1607.06450

Mixed citation behavior. Most common role is background (60%).

166 Pith papers citing it

Background 60% of classified citations

open full Pith review browse 166 citing papers more from Jamie Ryan Kiros arXiv PDF

abstract

Training state-of-the-art, deep neural networks is computationally expensive. One way to reduce the training time is to normalize the activities of the neurons. A recently introduced technique called batch normalization uses the distribution of the summed input to a neuron over a mini-batch of training cases to compute a mean and variance which are then used to normalize the summed input to that neuron on each training case. This significantly reduces the training time in feed-forward neural networks. However, the effect of batch normalization is dependent on the mini-batch size and it is not obvious how to apply it to recurrent neural networks. In this paper, we transpose batch normalization into layer normalization by computing the mean and variance used for normalization from all of the summed inputs to the neurons in a layer on a single training case. Like batch normalization, we also give each neuron its own adaptive bias and gain which are applied after the normalization but before the non-linearity. Unlike batch normalization, layer normalization performs exactly the same computation at training and test times. It is also straightforward to apply to recurrent neural networks by computing the normalization statistics separately at each time step. Layer normalization is very effective at stabilizing the hidden state dynamics in recurrent networks. Empirically, we show that layer normalization can substantially reduce the training time compared with previously published techniques.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 3 method 2

citation-polarity summary

background 3 use method 2

claims ledger

abstract Training state-of-the-art, deep neural networks is computationally expensive. One way to reduce the training time is to normalize the activities of the neurons. A recently introduced technique called batch normalization uses the distribution of the summed input to a neuron over a mini-batch of training cases to compute a mean and variance which are then used to normalize the summed input to that neuron on each training case. This significantly reduces the training time in feed-forward neural networks. However, the effect of batch normalization is dependent on the mini-batch size and it is not

authors

and Geoffrey E Jamie Ryan Kiros Jimmy Lei Ba

co-cited works

representative citing papers

MinMax Recurrent Neural Cascades

cs.LG · 2026-05-07 · conditional · novelty 8.0 · 2 refs

MinMax RNCs are recurrent neural models using min-max recurrence that achieve full regular-language expressivity, logarithmic parallel evaluation, uniformly bounded states, and constant state gradients independent of time distance.

Characterizing the Expressivity of Local Attention in Transformers

cs.CL · 2026-05-01 · unverdicted · novelty 8.0

Local attention strictly enlarges the class of regular languages recognizable by fixed-precision transformers by adding a second past operator in linear temporal logic, with global and local attention being expressively complementary.

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

cs.LG · 2023-12-01 · unverdicted · novelty 8.0

Mamba is a linear-time sequence model using input-dependent selective SSMs that achieves SOTA results across modalities and matches twice-larger Transformers on language modeling with 5x higher inference throughput.

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

cs.CL · 2020-12-31 · conditional · novelty 8.0

The Pile is a newly constructed 825 GiB dataset from 22 diverse sources that enables language models to achieve better performance on academic, professional, and cross-domain tasks than models trained on Common Crawl variants.

Reformer: The Efficient Transformer

cs.LG · 2020-01-13 · accept · novelty 8.0

Reformer matches standard Transformer accuracy on long sequences while using far less memory and running faster via LSH attention and reversible residual layers.

Trajectory-Agnostic Asteroid Detection in TESS with Deep Learning

astro-ph.EP · 2026-05-12 · unverdicted · novelty 7.0

A W-Net deep learning model detects asteroids in TESS data independently of trajectory by rotating training image cubes and using adaptive normalization for data scaling.

QAP-Router: Tackling Qubit Routing as Dynamic Quadratic Assignment with Reinforcement Learning

quant-ph · 2026-05-12 · unverdicted · novelty 7.0

QAP-Router models qubit routing as dynamic QAP and applies RL with a solution-aware Transformer to cut CNOT counts by 12-30% versus industry compilers on real circuit benchmarks.

Learning Less Is More: Premature Upper-Layer Attention Specialization Hurts Language Model Pretraining

cs.CL · 2026-05-11 · unverdicted · novelty 7.0

Temporarily reducing the learning rate on upper-layer query and key projections during early GPT pretraining prevents premature attention specialization and improves model performance.

OpenSGA: Efficient 3D Scene Graph Alignment in the Open World

cs.CV · 2026-05-11 · conditional · novelty 7.0

OpenSGA fuses vision-language, textual, and geometric features via a distance-gated attention encoder and minimum-cost-flow allocator to outperform prior methods on both frame-to-scan and subscan-to-subscan 3D scene graph alignment, backed by a new 700k-sample ScanNet-SG dataset.

Meta-Black-Box Optimization Can Do Search Guidance for Expensive Constrained Multi-Objective Optimization

cs.NE · 2026-05-11 · unverdicted · novelty 7.0

MetaSG-SAEA is a bi-level meta-BBO framework that uses a meta-policy for search guidance via the MM-CCI constraint abstraction and diffusion-based population initialization to outperform baselines on expensive constrained multi-objective optimization problems.

Why Do Aligned LLMs Remain Jailbreakable: Refusal-Escape Directions, Operator-Level Sources, and Safety-Utility Trade-off

cs.CR · 2026-05-09 · unverdicted · novelty 7.0

Aligned LLMs exhibit Refusal-Escape Directions (RED) that enable refusal-to-answer transitions via input perturbations; these directions decompose exactly into operator-level sources, creating an inherent safety-utility trade-off when trying to eliminate them.

Delta-Adapter: Scalable Exemplar-Based Image Editing with Single-Pair Supervision

cs.CV · 2026-05-08 · unverdicted · novelty 7.0

Delta-Adapter extracts a semantic delta from a single image pair via a pre-trained vision encoder and injects it through a Perceiver adapter to enable scalable single-pair supervised editing.

Neural network quantum states in the grand canonical ensemble

quant-ph · 2026-05-08 · unverdicted · novelty 7.0

A new neural quantum state ansatz for bosons in the grand canonical ensemble achieves competitive variational energies in 1D and 2D systems and provides access to one-body reduced density matrices.

QuadNorm: Resolution-Robust Normalization for Neural Operators

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

QuadNorm uses quadrature-based moments instead of uniform averaging in normalization layers, achieving O(h²) consistency across resolutions and better cross-resolution transfer in neural operators.

GPROF-IR: An Improved Single-Channel Infrared Precipitation Retrieval for Merged Satellite Precipitation Products

physics.ao-ph · 2026-05-08 · unverdicted · novelty 7.0

GPROF-IR is a CNN-based retrieval that uses temporal context in geostationary IR observations to produce precipitation estimates with lower error than prior IR methods and climatological consistency with PMW retrievals for integration into IMERG V08.

Solving Max-Cut to Global Optimality via Feasibility-Preserving Graph Neural Networks

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

A Max-Cut-specific graph neural network predicts primal- and dual-feasible SDP solutions in linearithmic time, cutting bounding costs in exact branch-and-bound by up to 10.6 times versus a commercial SDP solver while training without any solved SDP labels.

Every Feedforward Neural Network Definable in an o-Minimal Structure Has Finite Sample Complexity

stat.ML · 2026-05-08 · unverdicted · novelty 7.0

Every fixed finite feedforward neural network definable in an o-minimal structure has finite sample complexity in the agnostic PAC setting.

SOPE: Stabilizing Off-Policy Evaluation for Online RL with Prior Data

cs.LG · 2026-05-07 · unverdicted · novelty 7.0

SOPE uses an actor-aligned OPE signal on a held-out validation split to dynamically stop offline stabilization phases in online RL, improving performance up to 45.6% and cutting TFLOPs up to 22x on 25 Minari tasks.

How Long Does Infinite Width Last? Signal Propagation in Long-Range Linear Recurrences

cs.LG · 2026-05-06 · unverdicted · novelty 7.0

In linear recurrent models, infinite-width signal propagation remains accurate only for depths t much smaller than sqrt(width n), with a critical regime at t ~ c sqrt(n) where finite-width effects emerge and dominate for larger t.

PHALAR: Phasors for Learned Musical Audio Representations

cs.SD · 2026-05-05 · unverdicted · novelty 7.0 · 3 refs

PHALAR achieves up to 70% relative accuracy gain in stem retrieval with under half the parameters and 7x faster training by using phasor-based equivariant representations, setting new SOTA on multiple datasets.

iGENE: A Differentiable Flux-Tube Gyrokinetic Code in TensorFlow

physics.plasm-ph · 2026-05-04 · unverdicted · novelty 7.0

A fully differentiable TensorFlow gyrokinetic code allows approximate gradients of nonlinear turbulence quantities to be used for outer-loop tasks such as profile prediction despite stochasticity.

Evaluating LLMs on Large-Scale Graph Property Estimation via Random Walks

cs.LG · 2026-05-02 · unverdicted · novelty 7.0

EstGraph benchmark evaluates LLMs on estimating properties of very large graphs from random-walk samples that fit in context limits.

ResRL: Boosting LLM Reasoning via Negative Sample Projection Residual Reinforcement Learning

cs.LG · 2026-05-01 · unverdicted · novelty 7.0 · 2 refs

ResRL decouples shared semantics between positive and negative responses in LLM reinforcement learning via SVD-based projection residuals, outperforming baselines including NSR by up to 9.4% on math reasoning benchmarks.

DEFault++: Automated Fault Detection, Categorization, and Diagnosis for Transformer Architectures

cs.SE · 2026-04-30 · unverdicted · novelty 7.0

DEFault++ delivers automated hierarchical fault detection, categorization into 12 transformer-specific types, and root-cause diagnosis among 45 mechanisms on a new benchmark of 3,739 mutated instances, with AUROC >0.96 and Macro-F1 0.85, plus improved developer repair accuracy in a user study.

citing papers explorer

Showing 16 of 16 citing papers after filters.

Characterizing the Expressivity of Local Attention in Transformers cs.CL · 2026-05-01 · unverdicted · none · ref 1 · internal anchor
Local attention strictly enlarges the class of regular languages recognizable by fixed-precision transformers by adding a second past operator in linear temporal logic, with global and local attention being expressively complementary.
The Pile: An 800GB Dataset of Diverse Text for Language Modeling cs.CL · 2020-12-31 · conditional · none · ref 167 · internal anchor
The Pile is a newly constructed 825 GiB dataset from 22 diverse sources that enables language models to achieve better performance on academic, professional, and cross-domain tasks than models trained on Common Crawl variants.
Learning Less Is More: Premature Upper-Layer Attention Specialization Hurts Language Model Pretraining cs.CL · 2026-05-11 · unverdicted · none · ref 2 · internal anchor
Temporarily reducing the learning rate on upper-layer query and key projections during early GPT pretraining prevents premature attention specialization and improves model performance.
OPT: Open Pre-trained Transformer Language Models cs.CL · 2022-05-02 · unverdicted · none · ref 82 · internal anchor
OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.
DeBERTa: Decoding-enhanced BERT with Disentangled Attention cs.CL · 2020-06-05 · unverdicted · none · ref 1 · internal anchor
DeBERTa improves BERT-style models by separating content and relative position in attention and adding absolute positions to the decoder, yielding consistent gains on NLU and NLG tasks and the first single-model superhuman score on SuperGLUE.
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism cs.CL · 2019-09-17 · unverdicted · none · ref 2 · internal anchor
Intra-layer model parallelism in PyTorch enables training of 8.3B-parameter transformers, achieving SOTA perplexity of 10.8 on WikiText103 and 66.5% accuracy on LAMBADA.
Towards Understanding Continual Factual Knowledge Acquisition of Language Models: From Theory to Algorithm cs.CL · 2026-05-11 · unverdicted · none · ref 21 · internal anchor
Theoretical analysis of continual factual knowledge acquisition shows data replay stabilizes pretrained knowledge by shifting convergence dynamics while regularization only slows forgetting, leading to the STOC method for attention-based replay selection.
AGoQ: Activation and Gradient Quantization for Memory-Efficient Distributed Training of LLMs cs.CL · 2026-05-01 · unverdicted · none · ref 86 · 2 links · internal anchor
AGoQ delivers up to 52% lower memory use and 1.34x faster training for 8B-32B LLaMA models by using near-4-bit adaptive activations and 8-bit gradients while preserving pretraining convergence and downstream accuracy.
Hidden States Know Where Reasoning Diverges: Credit Assignment via Span-Level Wasserstein Distance cs.CL · 2026-04-25 · unverdicted · none · ref 14 · internal anchor
Span-level Wasserstein distances between hidden-state distributions of correct and incorrect rollouts provide a self-supervised signal to reweight advantages in GRPO, improving fine-grained credit assignment on math and code tasks.
Retentive Network: A Successor to Transformer for Large Language Models cs.CL · 2023-07-17 · unverdicted · none · ref 1 · internal anchor
RetNet is a new sequence modeling architecture that delivers parallel training, constant-time inference, and competitive language modeling performance as a potential replacement for Transformers.
AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning cs.CL · 2023-03-18 · unverdicted · none · ref 1 · internal anchor
AdaLoRA uses SVD-based pruning to allocate the parameter budget for low-rank fine-tuning updates according to per-matrix importance scores, yielding better performance than uniform allocation especially under tight budgets.
ST-MoE: Designing Stable and Transferable Sparse Expert Models cs.CL · 2022-02-17 · unverdicted · none · ref 131 · internal anchor
ST-MoE introduces stability techniques for sparse expert models, allowing a 269B-parameter model to achieve state-of-the-art transfer learning results across reasoning, summarization, and QA tasks at the compute cost of a 32B dense model.
Universal Transformers cs.CL · 2018-07-10 · unverdicted · none · ref 3 · internal anchor
Universal Transformers combine Transformer parallelism with recurrent updates and dynamic halting to achieve Turing-completeness under assumptions and outperform standard Transformers on algorithmic and language tasks.
Attention Is All You Need cs.CL · 2017-06-12 · unverdicted · none · ref 1
Pith review generated a malformed one-line summary.
A Survey on Large Language Models for Code Generation cs.CL · 2024-06-01 · unverdicted · none · ref 18 · internal anchor
A systematic literature review that organizes recent work on LLMs for code generation into a taxonomy covering data curation, model advances, evaluations, ethics, environmental impact, and applications, with benchmark comparisons.
A Survey of Large Language Models cs.CL · 2023-03-31 · accept · none · ref 280 · internal anchor
This survey reviews the background, key techniques, and evaluation methods for large language models, emphasizing emergent abilities that appear at large scales.

Layer Normalization

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer