super hub Mixed citations

Layer Normalization

Jamie Ryan Kiros, Jimmy Lei Ba · 2016 · stat.ML · arXiv 1607.06450

Mixed citation behavior. Most common role is background (58%).

445 Pith papers citing it

Background 58% of classified citations

open full Pith review browse 445 citing papers more from Jamie Ryan Kiros arXiv PDF

abstract

Training state-of-the-art, deep neural networks is computationally expensive. One way to reduce the training time is to normalize the activities of the neurons. A recently introduced technique called batch normalization uses the distribution of the summed input to a neuron over a mini-batch of training cases to compute a mean and variance which are then used to normalize the summed input to that neuron on each training case. This significantly reduces the training time in feed-forward neural networks. However, the effect of batch normalization is dependent on the mini-batch size and it is not obvious how to apply it to recurrent neural networks. In this paper, we transpose batch normalization into layer normalization by computing the mean and variance used for normalization from all of the summed inputs to the neurons in a layer on a single training case. Like batch normalization, we also give each neuron its own adaptive bias and gain which are applied after the normalization but before the non-linearity. Unlike batch normalization, layer normalization performs exactly the same computation at training and test times. It is also straightforward to apply to recurrent neural networks by computing the normalization statistics separately at each time step. Layer normalization is very effective at stabilizing the hidden state dynamics in recurrent networks. Empirically, we show that layer normalization can substantially reduce the training time compared with previously published techniques.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 45 method 23 baseline 2 other 2

citation-polarity summary

background 42 use method 23 unclear 5 baseline 2

claims ledger

abstract Training state-of-the-art, deep neural networks is computationally expensive. One way to reduce the training time is to normalize the activities of the neurons. A recently introduced technique called batch normalization uses the distribution of the summed input to a neuron over a mini-batch of training cases to compute a mean and variance which are then used to normalize the summed input to that neuron on each training case. This significantly reduces the training time in feed-forward neural networks. However, the effect of batch normalization is dependent on the mini-batch size and it is not

authors

and Geoffrey E Jamie Ryan Kiros Jimmy Lei Ba

co-cited works

representative citing papers

Tight Sample Complexity of Transformers

cs.LG · 2026-06-08 · unverdicted · novelty 8.0

Depth-L transformers with W parameters have VC dimension Theta(L W log(T W)), yielding matching O(L W log((T+T')W)) upper and Omega(L W log((T+T')W/L)) lower bounds on sample complexity for chain-of-thought learning.

Neural Networks Provably Learn Spectral Representations for Group Composition

cs.LG · 2026-06-02 · unverdicted · novelty 8.0

Two-layer neural networks provably converge almost surely to irreducible representations of finite groups when trained on the group composition task, with the dynamics governed by Riemannian gradient ascent on a representation-theoretic energy functional.

Towards Verifiable Transformers: Solver-Checkable Circuit Explanations

cs.LG · 2026-05-21 · unverdicted · novelty 8.0

Presents a solver-verifiable framework for Transformer circuits, with exhaustive checks on small symbolic tasks and surrogate methods for larger models.

CanViT: Toward Active-Vision Foundation Models

cs.CV · 2026-03-23 · conditional · novelty 8.0

CanViT is the first task- and policy-agnostic AVFM pretrained via passive-to-active dense latent distillation on 13.2M scenes and 1B random glimpses, achieving 38.5% ADE20K mIoU in one glimpse and 84.5% ImageNet-1k top-1 after fine-tuning.

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

cs.LG · 2023-12-01 · unverdicted · novelty 8.0

Mamba is a linear-time sequence model using input-dependent selective SSMs that achieves SOTA results across modalities and matches twice-larger Transformers on language modeling with 5x higher inference throughput.

What learning algorithm is in-context learning? Investigations with linear models

cs.LG · 2022-11-28 · accept · novelty 8.0

Transformers performing in-context learning implicitly implement gradient descent, ridge regression, and least-squares predictors for linear models, with behavior shifting based on model depth, width, and data noise.

Masked Autoencoders Are Scalable Vision Learners

cs.CV · 2021-11-11 · accept · novelty 8.0

Masked autoencoders with asymmetric encoder-decoder and 75% masking ratio enable scalable self-supervised pre-training of vision transformers, achieving 87.8% ImageNet-1K accuracy with ViT-Huge using only unlabeled data.

Decision Transformer: Reinforcement Learning via Sequence Modeling

cs.LG · 2021-06-02 · accept · novelty 8.0

Decision Transformer casts RL as autoregressive sequence modeling conditioned on desired returns, past states and actions, matching or exceeding offline RL baselines on Atari, Gym and Key-to-Door tasks.

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

cs.CL · 2020-12-31 · conditional · novelty 8.0

The Pile is a newly constructed 825 GiB dataset from 22 diverse sources that enables language models to achieve better performance on academic, professional, and cross-domain tasks than models trained on Common Crawl variants.

Reformer: The Efficient Transformer

cs.LG · 2020-01-13 · accept · novelty 8.0

Reformer matches standard Transformer accuracy on long sequences while using far less memory and running faster via LSH attention and reversible residual layers.

LeVLJEPA: End-to-End Vision-Language Pretraining Without Negatives

cs.CV · 2026-07-01 · unverdicted · novelty 7.0

LeVLJEPA is the first non-contrastive vision-language pretraining method that learns via cross-modal prediction without negatives, producing stronger dense features than contrastive baselines on VQA and segmentation tasks.

Exploring Line Bundle Standard Models with Transformers

hep-th · 2026-06-30 · unverdicted · novelty 7.0

A Transformer RL agent is trained to generate valid heterotic line bundle sums on CICYs that satisfy gauge embedding, anomaly cancellation, poly-stability, chirality, and no-exotics constraints.

Probing Memorization of Tabular In-Context Learning

cs.LG · 2026-06-30 · unverdicted · novelty 7.0

A new probing framework detects moderate parametric memorization signals in tabular in-context learning models under single-task fine-tuning, strongest on low-cardinality tasks, but signals largely disappear under realistic training.

Probing-Guided Layer Selection from Self-Supervised Speech Models for Generalizable Audio Deepfake Detection

cs.SD · 2026-06-29 · unverdicted · novelty 7.0

Probing-guided selection of depth zones from frozen SSL speech models yields compact classifiers with 28% relative EER improvement on cross-domain deepfake detection tasks.

See & Sniff: Learning Visuo-Olfactory Representations

cs.CV · 2026-06-25 · unverdicted · novelty 7.0

Introduces SmellNet-V synthetic visuo-olfactory dataset and See & Sniff self-supervised framework that learns aligned representations and produces smell saliency maps.

FunPiQ: A New Benchmark for Pixel-Level Quality Assessment in Fundus Images

cs.CV · 2026-06-24 · unverdicted · novelty 7.0

FunPiQ supplies the first pixel-level FIQA benchmark and EFIQA-CP uses anatomical visibility pseudo-labels with NNPU learning to deliver superior explainable quality assessment over classification and anomaly detection baselines.

Convergence of Gradient Descent for General Neural Network Architectures Beyond the NTK Regime

cs.LG · 2026-06-22 · unverdicted · novelty 7.0

Proves GD convergence to stationary point neighborhoods for general NN architectures beyond NTK via block-level analysis, analyticity, and local smoothness conditions.

FTP-1: A Generalist Foundation Tactile Policy Across Tactile Sensors for Contact-Rich Manipulation

cs.RO · 2026-06-11 · unverdicted · novelty 7.0

FTP-1 is the first foundation tactile policy pretrained on ~3000 hours of data from 26 sources across 21 sensors that improves performance on seen setups by 17.2% and transfers to unseen sensors with 31% success rate gain.

Do Transformers Actually Help Intrusion Detection? A Temporal Sequence Evaluation on CIC-IDS2017

cs.CR · 2026-06-09 · accept · novelty 7.0

Padding convention and split protocol affect intrusion detection performance more than architecture on CIC-IDS2017, with Transformers showing 0.24 macro-F1 drop under zero-pad+mask and 67-fold false-alarm rise under leakage-free evaluation.

GNSS-FM: A Self-Supervised Foundation Model for Daily GNSS Displacement Time Series

physics.geo-ph · 2026-06-05 · unverdicted · novelty 7.0

GNSS-FM is a self-supervised foundation model for GNSS displacement time series that outperforms task-specific baselines on 90-day forecasting and seismic step localization after pretraining on global station data.

Synthetic Benchmarks Overstate Forward-Forward Scaling: Real-Data Limits of Layer-Local Training

cs.CV · 2026-06-04 · conditional · novelty 7.0

DTG-FF reaches 91.8% on CIFAR-10 and 49.4% on ImageNet-100 224x224 but BP baselines beat it by 2.4-5.93 pp with gaps widening by class count on real data while reversing the synthetic trend.

Private and Stable Test-Time Adaptation with Differential Privacy

cs.LG · 2026-06-01 · unverdicted · novelty 7.0

Differential privacy versions of TTA methods achieve privacy on ImageNet-C with small accuracy cost and can improve stability via clipping in continual settings.

SurGe: Improved Surface Geometry in Point Maps

cs.CV · 2026-05-29 · unverdicted · novelty 7.0

SurGe improves local surface geometry in feedforward point maps via gradient matching loss and Neighborhood Attention Decoder, topping average rank on eight zero-shot monocular geometry benchmarks for global AbsRel while boosting local metrics.

Cognitive Fatigue in Autoregressive Transformers: Formalization and Measurement

cs.CL · 2026-05-29 · unverdicted · novelty 7.0

Autoregressive transformers exhibit measurable cognitive fatigue during extended generation, quantified by the Fatigue Index that predicts degradation (AUROC 0.95) and repetition (rho 0.94).

citing papers explorer

Showing 26 of 26 citing papers after filters.

CanViT: Toward Active-Vision Foundation Models cs.CV · 2026-03-23 · conditional · none · ref 56 · internal anchor
CanViT is the first task- and policy-agnostic AVFM pretrained via passive-to-active dense latent distillation on 13.2M scenes and 1B random glimpses, achieving 38.5% ADE20K mIoU in one glimpse and 84.5% ImageNet-1k top-1 after fine-tuning.
The Pile: An 800GB Dataset of Diverse Text for Language Modeling cs.CL · 2020-12-31 · conditional · none · ref 167 · internal anchor
The Pile is a newly constructed 825 GiB dataset from 22 diverse sources that enables language models to achieve better performance on academic, professional, and cross-domain tasks than models trained on Common Crawl variants.
Synthetic Benchmarks Overstate Forward-Forward Scaling: Real-Data Limits of Layer-Local Training cs.CV · 2026-06-04 · conditional · none · ref 6 · internal anchor
DTG-FF reaches 91.8% on CIFAR-10 and 49.4% on ImageNet-100 224x224 but BP baselines beat it by 2.4-5.93 pp with gaps widening by class count on real data while reversing the synthetic trend.
OpenSGA: Efficient 3D Scene Graph Alignment in the Open World cs.CV · 2026-05-11 · conditional · none · ref 11 · internal anchor
OpenSGA fuses vision-language, textual, and geometric features via a distance-gated attention encoder and minimum-cost-flow allocator to outperform prior methods on both frame-to-scan and subscan-to-subscan 3D scene graph alignment, backed by a new 700k-sample ScanNet-SG dataset.
To See the Unseen: on the Generalization Ability of Transformers in Symbolic Reasoning cs.AI · 2026-04-23 · conditional · none · ref 3 · internal anchor
Unembedding collapse in transformers prevents distinguishing unseen tokens in symbolic reasoning, but targeted interventions restore generalization.
One Scale at a Time: Scale-Autoregressive Modeling for Fluid Flow Distributions cs.CE · 2026-04-13 · conditional · none · ref 2 · internal anchor
Scale-autoregressive modeling (SAR) samples fluid flow distributions hierarchically from coarse to fine resolutions on meshes, achieving lower distributional error and 2-7x faster runtime than diffusion or flow-matching baselines.
Denoising Particle Filters: Learning State Estimation with Single-Step Objectives cs.RO · 2026-02-23 · conditional · none · ref 34 · internal anchor
Denoising particle filters train state estimators on individual transitions via score matching, then use the learned denoiser with a dynamics model to approximate Bayesian filtering step-by-step, matching end-to-end baselines while preserving composability.
LRM: Large Reconstruction Model for Single Image to 3D cs.CV · 2023-11-08 · conditional · none · ref 1 · internal anchor
LRM is a large transformer that predicts a NeRF directly from a single image after training on a million-object multi-view dataset.
Efficient Memory Management for Large Language Model Serving with PagedAttention cs.LG · 2023-09-12 · conditional · none · ref 2 · internal anchor
PagedAttention achieves near-zero waste in LLM key-value cache memory and enables 2-4x higher serving throughput than prior systems.
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale cs.LG · 2022-08-15 · conditional · none · ref 64 · internal anchor
LLM.int8() performs 8-bit inference for transformers up to 175B parameters with no accuracy loss by combining vector-wise quantization for most features with 16-bit mixed-precision handling of systematic outlier dimensions.
Deep Modular Co-Attention Networks for Visual Question Answering cs.CV · 2019-06-25 · conditional · none · ref 3 · internal anchor
MCAN stacks modular co-attention layers to reach 70.63% accuracy on VQA-v2 test-dev, outperforming prior state-of-the-art models.
Do Transformers Need Three Projections? Systematic Study of QKV Variants cs.LG · 2026-06-01 · conditional · none · ref 26 · internal anchor
Q-K=V projection sharing in transformers matches standard QKV performance with 50% KV cache reduction and combines with GQA/MQA for up to 96.9% reduction across vision and language tasks.
SOPE: Stabilizing Off-Policy Evaluation for Online RL with Prior Data cs.LG · 2026-05-07 · conditional · none · ref 1 · 2 links · internal anchor
SOPE dynamically controls offline training length in online RL using actor-aligned OPE on validation data to stop when benefits saturate, achieving up to 45.6% better performance and 22x less computation on Minari tasks.
MoDAl: Self-Supervised Neural Modality Discovery via Decorrelation for Speech Neuroprosthesis q-bio.NC · 2026-04-22 · conditional · none · ref 2 · 2 links · internal anchor
A decorrelation-based training framework recovers complementary linguistic information from Broca's area signals, improving speech neuroprosthesis WER from 26.3% to 21.6% over prior end-to-end methods.
Rethinking Language Model Scaling under Transferable Hypersphere Optimization cs.LG · 2026-03-30 · conditional · none · ref 2 · internal anchor
HyperP transfers optimal learning rates across model width, depth, tokens, and MoE granularity under Frobenius-sphere constraints, delivering stable scaling and 1.58x efficiency gains.
What Does Flow Matching Bring To TD Learning? cs.LG · 2026-03-04 · conditional · none · ref 4 · internal anchor
Flow matching critics outperform monolithic ones in RL by 2x performance and 5x sample efficiency via test-time error recovery through integration and multi-point velocity supervision that preserves feature plasticity.
Graph-Based Alternatives to LLMs for Human Simulation cs.CL · 2025-11-03 · conditional · none · ref 4 · internal anchor
GEMS formulates close-ended human-behavior simulation as link prediction on a heterogeneous graph and matches or exceeds LLM performance with three orders of magnitude fewer parameters across three datasets and three evaluation settings.
Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models cs.AI · 2024-08-01 · conditional · none · ref 21 · internal anchor
Empirical analysis shows scaling inference compute via strategies like tree search can be more efficient than scaling model parameters, with 7B models plus novel search outperforming 34B models.
The Falcon Series of Open Language Models cs.CL · 2023-11-28 · conditional · none · ref 75 · internal anchor
Falcon-180B is a 180B-parameter open decoder-only model trained on 3.5 trillion tokens that approaches PaLM-2-Large performance at lower cost and is released with dataset extracts.
Enhancing Chat Language Models by Scaling High-quality Instructional Conversations cs.CL · 2023-05-23 · conditional · none · ref 62 · internal anchor
UltraChat supplies 1.5 million high-quality multi-turn dialogues that, when used to fine-tune LLaMA, produce UltraLLaMA, which outperforms prior open-source chat models including Vicuna.
IDQL: Implicit Q-Learning as an Actor-Critic Method with Diffusion Policies cs.LG · 2023-04-20 · conditional · none · ref 2 · internal anchor
IDQL generalizes IQL into an actor-critic framework and uses diffusion policies for robust policy extraction, outperforming prior offline RL methods.
Relational inductive biases, deep learning, and graph networks cs.LG · 2018-06-04 · conditional · none · ref 1 · internal anchor
Graph networks unify graph-based neural methods into a general framework with strong relational inductive biases to support combinatorial generalization and structured reasoning in AI.
Capacity-Controlled Global Attention for Graph Transformers cs.LG · 2026-04-19 · conditional · none · ref 26 · internal anchor
SigGate-GT adds a per-head sigmoid gate to graph transformer attention outputs, relaxing the softmax convex-combination constraint to reduce over-smoothing and improve stability at ~1% parameter overhead.
Resolution scaling governs DINOv3 transfer performance in chest radiograph classification cs.CV · 2025-10-08 · conditional · none · ref 48 · internal anchor
DINOv3 at 512x512 resolution with ConvNeXt-B outperforms prior initializations for adult chest X-ray classification but shows no benefit in pediatric cohorts or at 1024 resolution.
Root Mean Square Layer Normalization cs.LG · 2019-10-16 · conditional · none · ref 3 · internal anchor
RMSNorm delivers re-scaling invariance and comparable accuracy to LayerNorm while cutting computation by skipping mean subtraction, yielding 7-64% runtime reductions across tested models.
MiniGPT: Rebuilding GPT from First Principles cs.CL · 2026-05-17 · conditional · none · ref 45 · internal anchor
MiniGPT is a self-contained PyTorch implementation of standard GPT autoregressive modeling that reaches 1.478 validation loss on Tiny Shakespeare with a 10.77M-parameter model and produces recognizable Shakespeare-style text.

Layer Normalization

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer