super hub

Layer Normalization

Jamie Ryan Kiros, Jimmy Lei Ba · 2016 · stat.ML · arXiv 1607.06450

165 Pith papers cite this work. Polarity classification is still indexing.

165 Pith papers citing it

open full Pith review browse 165 citing papers more from Jamie Ryan Kiros arXiv PDF

abstract

Training state-of-the-art, deep neural networks is computationally expensive. One way to reduce the training time is to normalize the activities of the neurons. A recently introduced technique called batch normalization uses the distribution of the summed input to a neuron over a mini-batch of training cases to compute a mean and variance which are then used to normalize the summed input to that neuron on each training case. This significantly reduces the training time in feed-forward neural networks. However, the effect of batch normalization is dependent on the mini-batch size and it is not obvious how to apply it to recurrent neural networks. In this paper, we transpose batch normalization into layer normalization by computing the mean and variance used for normalization from all of the summed inputs to the neurons in a layer on a single training case. Like batch normalization, we also give each neuron its own adaptive bias and gain which are applied after the normalization but before the non-linearity. Unlike batch normalization, layer normalization performs exactly the same computation at training and test times. It is also straightforward to apply to recurrent neural networks by computing the normalization statistics separately at each time step. Layer normalization is very effective at stabilizing the hidden state dynamics in recurrent networks. Empirically, we show that layer normalization can substantially reduce the training time compared with previously published techniques.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 1

citation-polarity summary

background 1

claims ledger

abstract Training state-of-the-art, deep neural networks is computationally expensive. One way to reduce the training time is to normalize the activities of the neurons. A recently introduced technique called batch normalization uses the distribution of the summed input to a neuron over a mini-batch of training cases to compute a mean and variance which are then used to normalize the summed input to that neuron on each training case. This significantly reduces the training time in feed-forward neural networks. However, the effect of batch normalization is dependent on the mini-batch size and it is not

authors

and Geoffrey E Jamie Ryan Kiros Jimmy Lei Ba

co-cited works

representative citing papers

MinMax Recurrent Neural Cascades

cs.LG · 2026-05-07 · conditional · novelty 8.0 · 2 refs

MinMax RNCs are recurrent neural models using min-max recurrence that achieve full regular-language expressivity, logarithmic parallel evaluation, uniformly bounded states, and constant state gradients independent of time distance.

Characterizing the Expressivity of Local Attention in Transformers

cs.CL · 2026-05-01 · unverdicted · novelty 8.0

Local attention strictly enlarges the class of regular languages recognizable by fixed-precision transformers by adding a second past operator in linear temporal logic, with global and local attention being expressively complementary.

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

cs.LG · 2023-12-01 · unverdicted · novelty 8.0

Mamba is a linear-time sequence model using input-dependent selective SSMs that achieves SOTA results across modalities and matches twice-larger Transformers on language modeling with 5x higher inference throughput.

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

cs.CL · 2020-12-31 · conditional · novelty 8.0

The Pile is a newly constructed 825 GiB dataset from 22 diverse sources that enables language models to achieve better performance on academic, professional, and cross-domain tasks than models trained on Common Crawl variants.

Reformer: The Efficient Transformer

cs.LG · 2020-01-13 · accept · novelty 8.0

Reformer matches standard Transformer accuracy on long sequences while using far less memory and running faster via LSH attention and reversible residual layers.

Trajectory-Agnostic Asteroid Detection in TESS with Deep Learning

astro-ph.EP · 2026-05-12 · unverdicted · novelty 7.0

A W-Net deep learning model detects asteroids in TESS data independently of trajectory by rotating training image cubes and using adaptive normalization for data scaling.

QAP-Router: Tackling Qubit Routing as Dynamic Quadratic Assignment with Reinforcement Learning

quant-ph · 2026-05-12 · unverdicted · novelty 7.0

QAP-Router models qubit routing as dynamic QAP and applies RL with a solution-aware Transformer to cut CNOT counts by 12-30% versus industry compilers on real circuit benchmarks.

Learning Less Is More: Premature Upper-Layer Attention Specialization Hurts Language Model Pretraining

cs.CL · 2026-05-11 · unverdicted · novelty 7.0

Temporarily reducing the learning rate on upper-layer query and key projections during early GPT pretraining prevents premature attention specialization and improves model performance.

OpenSGA: Efficient 3D Scene Graph Alignment in the Open World

cs.CV · 2026-05-11 · conditional · novelty 7.0

OpenSGA fuses vision-language, textual, and geometric features via a distance-gated attention encoder and minimum-cost-flow allocator to outperform prior methods on both frame-to-scan and subscan-to-subscan 3D scene graph alignment, backed by a new 700k-sample ScanNet-SG dataset.

Meta-Black-Box Optimization Can Do Search Guidance for Expensive Constrained Multi-Objective Optimization

cs.NE · 2026-05-11 · unverdicted · novelty 7.0

MetaSG-SAEA is a bi-level meta-BBO framework that uses a meta-policy for search guidance via the MM-CCI constraint abstraction and diffusion-based population initialization to outperform baselines on expensive constrained multi-objective optimization problems.

Why Do Aligned LLMs Remain Jailbreakable: Refusal-Escape Directions, Operator-Level Sources, and Safety-Utility Trade-off

cs.CR · 2026-05-09 · unverdicted · novelty 7.0

Aligned LLMs exhibit Refusal-Escape Directions (RED) that enable refusal-to-answer transitions via input perturbations; these directions decompose exactly into operator-level sources, creating an inherent safety-utility trade-off when trying to eliminate them.

Delta-Adapter: Scalable Exemplar-Based Image Editing with Single-Pair Supervision

cs.CV · 2026-05-08 · unverdicted · novelty 7.0

Delta-Adapter extracts a semantic delta from a single image pair via a pre-trained vision encoder and injects it through a Perceiver adapter to enable scalable single-pair supervised editing.

Neural network quantum states in the grand canonical ensemble

quant-ph · 2026-05-08 · unverdicted · novelty 7.0

A new neural quantum state ansatz for bosons in the grand canonical ensemble achieves competitive variational energies in 1D and 2D systems and provides access to one-body reduced density matrices.

QuadNorm: Resolution-Robust Normalization for Neural Operators

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

QuadNorm uses quadrature-based moments instead of uniform averaging in normalization layers, achieving O(h²) consistency across resolutions and better cross-resolution transfer in neural operators.

GPROF-IR: An Improved Single-Channel Infrared Precipitation Retrieval for Merged Satellite Precipitation Products

physics.ao-ph · 2026-05-08 · unverdicted · novelty 7.0

GPROF-IR is a CNN-based retrieval that uses temporal context in geostationary IR observations to produce precipitation estimates with lower error than prior IR methods and climatological consistency with PMW retrievals for integration into IMERG V08.

Solving Max-Cut to Global Optimality via Feasibility-Preserving Graph Neural Networks

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

A Max-Cut-specific graph neural network predicts primal- and dual-feasible SDP solutions in linearithmic time, cutting bounding costs in exact branch-and-bound by up to 10.6 times versus a commercial SDP solver while training without any solved SDP labels.

Every Feedforward Neural Network Definable in an o-Minimal Structure Has Finite Sample Complexity

stat.ML · 2026-05-08 · unverdicted · novelty 7.0

Every fixed finite feedforward neural network definable in an o-minimal structure has finite sample complexity in the agnostic PAC setting.

SOPE: Stabilizing Off-Policy Evaluation for Online RL with Prior Data

cs.LG · 2026-05-07 · unverdicted · novelty 7.0

SOPE uses an actor-aligned OPE signal on a held-out validation split to dynamically stop offline stabilization phases in online RL, improving performance up to 45.6% and cutting TFLOPs up to 22x on 25 Minari tasks.

How Long Does Infinite Width Last? Signal Propagation in Long-Range Linear Recurrences

cs.LG · 2026-05-06 · unverdicted · novelty 7.0

In linear recurrent models, infinite-width signal propagation remains accurate only for depths t much smaller than sqrt(width n), with a critical regime at t ~ c sqrt(n) where finite-width effects emerge and dominate for larger t.

PHALAR: Phasors for Learned Musical Audio Representations

cs.SD · 2026-05-05 · unverdicted · novelty 7.0 · 3 refs

PHALAR achieves up to 70% relative accuracy gain in stem retrieval with under half the parameters and 7x faster training by using phasor-based equivariant representations, setting new SOTA on multiple datasets.

iGENE: A Differentiable Flux-Tube Gyrokinetic Code in TensorFlow

physics.plasm-ph · 2026-05-04 · unverdicted · novelty 7.0

A fully differentiable TensorFlow gyrokinetic code allows approximate gradients of nonlinear turbulence quantities to be used for outer-loop tasks such as profile prediction despite stochasticity.

Evaluating LLMs on Large-Scale Graph Property Estimation via Random Walks

cs.LG · 2026-05-02 · unverdicted · novelty 7.0

EstGraph benchmark evaluates LLMs on estimating properties of very large graphs from random-walk samples that fit in context limits.

ResRL: Boosting LLM Reasoning via Negative Sample Projection Residual Reinforcement Learning

cs.LG · 2026-05-01 · unverdicted · novelty 7.0 · 2 refs

ResRL decouples shared semantics between positive and negative responses in LLM reinforcement learning via SVD-based projection residuals, outperforming baselines including NSR by up to 9.4% on math reasoning benchmarks.

DEFault++: Automated Fault Detection, Categorization, and Diagnosis for Transformer Architectures

cs.SE · 2026-04-30 · unverdicted · novelty 7.0

DEFault++ delivers automated hierarchical fault detection, categorization into 12 transformer-specific types, and root-cause diagnosis among 45 mechanisms on a new benchmark of 3,739 mutated instances, with AUROC >0.96 and Macro-F1 0.85, plus improved developer repair accuracy in a user study.

citing papers explorer

Showing 36 of 36 citing papers after filters.

OpenSGA: Efficient 3D Scene Graph Alignment in the Open World cs.CV · 2026-05-11 · conditional · none · ref 11 · internal anchor
OpenSGA fuses vision-language, textual, and geometric features via a distance-gated attention encoder and minimum-cost-flow allocator to outperform prior methods on both frame-to-scan and subscan-to-subscan 3D scene graph alignment, backed by a new 700k-sample ScanNet-SG dataset.
Delta-Adapter: Scalable Exemplar-Based Image Editing with Single-Pair Supervision cs.CV · 2026-05-08 · unverdicted · none · ref 4 · internal anchor
Delta-Adapter extracts a semantic delta from a single image pair via a pre-trained vision encoder and injects it through a Perceiver adapter to enable scalable single-pair supervised editing.
HAC: Parameter-Efficient Hyperbolic Adaptation of CLIP for Zero-Shot VQA cs.CV · 2026-04-26 · unverdicted · none · ref 2 · internal anchor
HAC provides a parameter-efficient way to move CLIP into hyperbolic geometry, yielding consistent gains on zero-shot VQA benchmarks without any VQA training data overlap.
Latent Space Probing for Adult Content Detection in Video Generative Models cs.CV · 2026-04-25 · unverdicted · none · ref 50 · internal anchor
Latent space probing on CogVideoX achieves 97.29% F1 for adult content detection on a new 11k-clip dataset with 4-6ms overhead.
Linear Image Generation by Synthesizing Exposure Brackets cs.CV · 2026-04-22 · unverdicted · none · ref 1 · internal anchor
The paper introduces a DiT-based flow-matching model that generates linear images by synthesizing text-conditioned exposure brackets to preserve full dynamic range.
Envisioning the Future, One Step at a Time cs.CV · 2026-04-10 · unverdicted · none · ref 4 · internal anchor
An autoregressive diffusion model on sparse point trajectories predicts multi-modal future scene dynamics from single images with orders-of-magnitude faster sampling than dense video simulators while matching accuracy.
Multi-Head Attention based interaction-aware architecture for Bangla Handwritten Character Recognition: Introducing a Primary Dataset cs.CV · 2026-04-08 · accept · none · ref 37 · internal anchor
A new balanced Bangla handwritten character dataset paired with a multi-head attention hybrid model using EfficientNetB3, ViT, and Conformer achieves high accuracy and strong generalization.
Unified Vector Floorplan Generation via Markup Representation cs.CV · 2026-04-06 · unverdicted · none · ref 2 · internal anchor
A single transformer model using a new markup representation generates functional floorplans from diverse conditions and outperforms prior task-specific methods on the RPLAN dataset.
Deformation-based In-Context Learning for Point Cloud Understanding cs.CV · 2026-04-03 · unverdicted · none · ref 1 · internal anchor
DeformPIC deforms query point clouds under prompt guidance for in-context learning, outperforming prior methods with lower Chamfer Distance on reconstruction, denoising, and registration tasks.
Segment Anything cs.CV · 2023-04-05 · unverdicted · none · ref 4 · internal anchor
A promptable model trained on 1B masks achieves competitive zero-shot segmentation performance across tasks and is released publicly with its dataset.
DreamFusion: Text-to-3D using 2D Diffusion cs.CV · 2022-09-29 · accept · none · ref 90 · internal anchor
Optimizes a Neural Radiance Field via probability density distillation from a 2D diffusion model to produce text-conditioned 3D scenes viewable from any angle.
Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding cs.CV · 2022-05-23 · accept · none · ref 2 · internal anchor
Imagen achieves state-of-the-art photorealistic text-to-image generation by scaling a text-only pretrained T5 language model within a diffusion framework, reaching FID 7.27 on COCO without training on it.
Stereo Magnification: Learning View Synthesis using Multiplane Images cs.CV · 2018-05-24 · unverdicted · none · ref 3 · internal anchor
A deep network predicts multiplane images from narrow-baseline stereo pairs to synthesize novel views that extrapolate beyond the input baseline.
Metonymy in vision models undermines attention-based interpretability cs.CV · 2026-05-07 · unverdicted · none · ref 3 · internal anchor
Pretrained vision transformers exhibit strong intra-object leakage where each part representation encodes information from the entire object, undermining the faithfulness of attention-based part-centric interpretability methods.
GTF: Omnidirectional EPI Transformer for Light Field Super-Resolution cs.CV · 2026-05-06 · unverdicted · none · ref 1 · internal anchor
GTF is an omnidirectional EPI Transformer for light field super-resolution that models horizontal, vertical, 45-degree and 135-degree epipolar geometries, reaching 32.78 dB on benchmarks and top ranks in the NTIRE 2026 challenge.
Ramen: Robust Test-Time Adaptation of Vision-Language Models with Active Sample Selection cs.CV · 2026-04-23 · unverdicted · none · ref 1 · internal anchor
Ramen enables robust test-time adaptation of vision-language models under mixed-domain shifts by actively selecting domain-consistent and prediction-balanced samples via an embedding-gradient cache.
Volume Transformer: Revisiting Vanilla Transformers for 3D Scene Understanding cs.CV · 2026-04-21 · unverdicted · none · ref 62 · internal anchor
A minimally modified vanilla Transformer called Volt achieves state-of-the-art 3D semantic and instance segmentation by using volumetric tokens, 3D rotary embeddings, and a data-efficient training recipe that scales better than domain-specific backbones.
EmbodiedHead: Real-Time Listening and Speaking Avatar for Conversational Agents cs.CV · 2026-04-19 · unverdicted · none · ref 2 · internal anchor
EmbodiedHead introduces a Rectified-Flow Diffusion Transformer with differentiable renderer and single-stream listening-speaking conditioning to achieve real-time high-fidelity conversational avatars.
STS-Mixer: Spatio-Temporal-Spectral Mixer for 4D Point Cloud Video Understanding cs.CV · 2026-04-13 · unverdicted · none · ref 1 · internal anchor
STS-Mixer decomposes 4D point cloud videos into multi-band spectral signals via graph transforms and mixes them with spatiotemporal representations to achieve better results on 3D action recognition and 4D semantic segmentation benchmarks.
Seeing Through Touch: Tactile-Driven Visual Localization of Material Regions cs.CV · 2026-04-13 · unverdicted · none · ref 3 · internal anchor
The model uses dense visuo-tactile feature interactions and material-diversity pairing on expanded datasets to generate tactile saliency maps for material segmentation, outperforming prior global-alignment methods.
Q-Zoom: Query-Aware Adaptive Perception for Efficient Multimodal Large Language Models cs.CV · 2026-04-08 · unverdicted · none · ref 63 · internal anchor
Q-Zoom achieves up to 4.39x inference speedup in high-resolution MLLM scenarios via query-aware gating and region localization, matching or exceeding baseline accuracy on document and high-res benchmarks.
Visual prompting reimagined: The power of the Activation Prompts cs.CV · 2026-04-07 · unverdicted · none · ref 57 · internal anchor
Activation prompts on intermediate layers outperform input-level visual prompting and parameter-efficient fine-tuning in accuracy and efficiency across 29 datasets.
R\'enyi Attention Entropy for Patch Pruning cs.CV · 2026-04-04 · unverdicted · none · ref 3 · internal anchor
Rényi entropy of attention maps serves as a tunable criterion for pruning redundant patches in vision transformers, reducing compute with preserved accuracy on image recognition.
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling cs.CV · 2024-12-06 · unverdicted · none · ref 11 · internal anchor
InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.
IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models cs.CV · 2023-08-13 · unverdicted · none · ref 41 · internal anchor
IP-Adapter adds effective image prompting to text-to-image diffusion models using a lightweight decoupled cross-attention adapter that works alongside text prompts and other controls.
VideoGPT: Video Generation using VQ-VAE and Transformers cs.CV · 2021-04-20 · accept · none · ref 2 · internal anchor
VideoGPT generates competitive natural videos by learning discrete latents with VQ-VAE and modeling them autoregressively with a transformer.
Linearizing Vision Transformer with Test-Time Training cs.CV · 2026-05-04 · unverdicted · none · ref 1
Using Test-Time Training's structural match to Softmax attention plus key normalization and locality modules allows inheriting pretrained weights and fine-tuning Stable Diffusion 3.5 in one hour to match quality while speeding inference 1.32-1.47x.
USEMA: a Scalable Efficient Mamba Like Attention for Medical Image Segmentation cs.CV · 2026-05-11 · unverdicted · none · ref 2 · internal anchor
USEMA is a hybrid UNet architecture merging CNNs with scalable Mamba-like attention (SEMA) that achieves better efficiency than transformers and superior segmentation accuracy than pure CNN or Mamba models across medical imaging modalities.
Spatial-Frequency Gated Swin Transformer for Remote Sensing Single-Image Super-Resolution cs.CV · 2026-05-10 · unverdicted · none · ref 1 · internal anchor
SFG-SwinSR improves PSNR to 45.19 dB and SSIM to 0.9852 on SpaceNet by adding a depthwise-blur plus gated spatial branch inside each Swin2SR feed-forward network.
Static and Dynamic Graph Alignment Network for Temporal Video Grounding cs.CV · 2026-05-01 · unverdicted · none · ref 47 · internal anchor
SDGAN improves temporal video grounding by jointly using static and dynamic graphs with query-clip contrastive learning and easy-to-hard progressive training.
Time-series Meets Complex Motion Modeling: Robust and Computational-effective Motion Predictor for Multi-object Tracking cs.CV · 2026-05-01 · unverdicted · none · ref 36 · internal anchor
TCMP achieves SOTA MOT metrics (HOTA 63.4%, IDF1 65.0%, AssA 49.1%) with 0.014x parameters and 0.05x FLOPs of the previous best method by using a simple dilated TCN regressor.
SAT: Selective Aggregation Transformer for Image Super-Resolution cs.CV · 2026-04-09 · unverdicted · none · ref 5 · internal anchor
SAT introduces density and isolation-based token aggregation to enable efficient global attention in super-resolution transformers, claiming up to 0.22 dB PSNR gain and 27% FLOP reduction over PFT.
From Spherical to Gaussian: A Comparative Analysis of Point Cloud Cropping Strategies in Large-Scale 3D Environments cs.CV · 2026-05-03 · unverdicted · none · ref 70
Gaussian and linear cropping strategies for large point clouds improve 3D neural network performance over spherical crops, especially in outdoor scenes, and achieve new state-of-the-art results.
Lightweight Cross-Spectral Face Recognition via Contrastive Alignment and Distillation cs.CV · 2026-05-06 · unverdicted · none · ref 53 · internal anchor
A lightweight hybrid CNN-Transformer framework for heterogeneous face recognition achieves competitive performance on cross-spectral benchmarks and standard RGB tasks using contrastive alignment and distillation.
Wan-Image: Pushing the Boundaries of Generative Visual Intelligence cs.CV · 2026-04-21 · unverdicted · none · ref 2 · internal anchor
Wan-Image is a unified multi-modal system that integrates LLMs and diffusion transformers to deliver professional-grade image generation features including complex typography, multi-subject consistency, and precise editing, outperforming several prior models in human tests.
Cosmos World Foundation Model Platform for Physical AI cs.CV · 2025-01-07 · unverdicted · none · ref 107 · internal anchor
The Cosmos platform supplies open-source pre-trained world models and supporting tools for building fine-tunable digital world simulations to train Physical AI.

Layer Normalization

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer