Two-layer neural networks provably converge almost surely to irreducible representations of finite groups when trained on the group composition task, with the dynamics governed by Riemannian gradient ascent on a representation-theoretic energy functional.
super hub Mixed citations
Layer Normalization
Mixed citation behavior. Most common role is background (58%).
abstract
Training state-of-the-art, deep neural networks is computationally expensive. One way to reduce the training time is to normalize the activities of the neurons. A recently introduced technique called batch normalization uses the distribution of the summed input to a neuron over a mini-batch of training cases to compute a mean and variance which are then used to normalize the summed input to that neuron on each training case. This significantly reduces the training time in feed-forward neural networks. However, the effect of batch normalization is dependent on the mini-batch size and it is not obvious how to apply it to recurrent neural networks. In this paper, we transpose batch normalization into layer normalization by computing the mean and variance used for normalization from all of the summed inputs to the neurons in a layer on a single training case. Like batch normalization, we also give each neuron its own adaptive bias and gain which are applied after the normalization but before the non-linearity. Unlike batch normalization, layer normalization performs exactly the same computation at training and test times. It is also straightforward to apply to recurrent neural networks by computing the normalization statistics separately at each time step. Layer normalization is very effective at stabilizing the hidden state dynamics in recurrent networks. Empirically, we show that layer normalization can substantially reduce the training time compared with previously published techniques.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract Training state-of-the-art, deep neural networks is computationally expensive. One way to reduce the training time is to normalize the activities of the neurons. A recently introduced technique called batch normalization uses the distribution of the summed input to a neuron over a mini-batch of training cases to compute a mean and variance which are then used to normalize the summed input to that neuron on each training case. This significantly reduces the training time in feed-forward neural networks. However, the effect of batch normalization is dependent on the mini-batch size and it is not
authors
co-cited works
representative citing papers
Presents a solver-verifiable framework for Transformer circuits, with exhaustive checks on small symbolic tasks and surrogate methods for larger models.
CanViT is the first task- and policy-agnostic AVFM pretrained via passive-to-active dense latent distillation on 13.2M scenes and 1B random glimpses, achieving 38.5% ADE20K mIoU in one glimpse and 84.5% ImageNet-1k top-1 after fine-tuning.
Mamba is a linear-time sequence model using input-dependent selective SSMs that achieves SOTA results across modalities and matches twice-larger Transformers on language modeling with 5x higher inference throughput.
Transformers performing in-context learning implicitly implement gradient descent, ridge regression, and least-squares predictors for linear models, with behavior shifting based on model depth, width, and data noise.
Masked autoencoders with asymmetric encoder-decoder and 75% masking ratio enable scalable self-supervised pre-training of vision transformers, achieving 87.8% ImageNet-1K accuracy with ViT-Huge using only unlabeled data.
Decision Transformer casts RL as autoregressive sequence modeling conditioned on desired returns, past states and actions, matching or exceeding offline RL baselines on Atari, Gym and Key-to-Door tasks.
The Pile is a newly constructed 825 GiB dataset from 22 diverse sources that enables language models to achieve better performance on academic, professional, and cross-domain tasks than models trained on Common Crawl variants.
Reformer matches standard Transformer accuracy on long sequences while using far less memory and running faster via LSH attention and reversible residual layers.
A new probing framework detects moderate parametric memorization signals in tabular in-context learning models under single-task fine-tuning, strongest on low-cardinality tasks, but signals largely disappear under realistic training.
Probing-guided selection of depth zones from frozen SSL speech models yields compact classifiers with 28% relative EER improvement on cross-domain deepfake detection tasks.
Differential privacy versions of TTA methods achieve privacy on ImageNet-C with small accuracy cost and can improve stability via clipping in continual settings.
SurGe improves local surface geometry in feedforward point maps via gradient matching loss and Neighborhood Attention Decoder, topping average rank on eight zero-shot monocular geometry benchmarks for global AbsRel while boosting local metrics.
Autoregressive transformers exhibit measurable cognitive fatigue during extended generation, quantified by the Fatigue Index that predicts degradation (AUROC 0.95) and repetition (rho 0.94).
A Set-Transformer architecture with self-attention encodes Pauli-string correlations, optimizes via commutation objective, and finds symmetries with near-deterministic success on physical models like Ising and Toric code.
Recurrent trace units enable exact RTRL with linear time/memory for streaming RL under partial observability, sustaining performance on long-chain memory tasks where TBPTT baselines collapse.
CRiSP uses neural-guided MCTS and curriculum learning to insert Clifford prefixes before parameterized rotations in VQAs, yielding mean 3.17x and max 45x gains in energy accuracy on 22-qubit QAOA benchmarks versus prior Clifford initializers.
Thermo-VL augments a frozen Molmo-7B VLM with a trainable thermal encoder and prompt-conditioned dual-attention fusion to improve cross-spectrum visual reasoning.
Proposes latent analogies and analogy transduction to enable compositional generalization to unseen goal-context pairs in offline GCRL, outperforming trajectory-stitching baselines on manipulation tasks.
Riemannian networks are introduced for the full-rank correlation matrix manifold by extending MLR, FC, and convolutional layers to five geometries with backpropagation methods for two, showing effectiveness over SPD and Grassmannian baselines.
RAT reformulates regularized natural policy gradients as vanilla gradients with a transformed advantage, computed efficiently via randomized block Kaczmarz iterations on on-policy data.
Nested-GPT is an autoregressive Transformer surrogate that generates variable-multiplicity parton showers while enforcing ordered Markovian branching and matches reference Monte Carlo results for leading-log non-global logarithm resummation in the large-Nc limit.
Domain transfer becomes identifiable from marginals plus one anchor under Jacobian sparsity, enabled by a randomized masked finite-difference regularizer.
The paper proves negative weight drift at initialization under MSE or cross-entropy with asymmetric activations, links it to up to 90% sparsity in GPT-nano, maps the sparsity-accuracy cliff across 79 configurations, and shows clipped ReLU² and GELU² improve validation loss.
citing papers explorer
-
CanViT: Toward Active-Vision Foundation Models
CanViT is the first task- and policy-agnostic AVFM pretrained via passive-to-active dense latent distillation on 13.2M scenes and 1B random glimpses, achieving 38.5% ADE20K mIoU in one glimpse and 84.5% ImageNet-1k top-1 after fine-tuning.
-
Masked Autoencoders Are Scalable Vision Learners
Masked autoencoders with asymmetric encoder-decoder and 75% masking ratio enable scalable self-supervised pre-training of vision transformers, achieving 87.8% ImageNet-1K accuracy with ViT-Huge using only unlabeled data.
-
SurGe: Improved Surface Geometry in Point Maps
SurGe improves local surface geometry in feedforward point maps via gradient matching loss and Neighborhood Attention Decoder, topping average rank on eight zero-shot monocular geometry benchmarks for global AbsRel while boosting local metrics.
-
Thermo-VL: Extending Vision-Language Models to Thermal Infrared Perception
Thermo-VL augments a frozen Molmo-7B VLM with a trainable thermal encoder and prompt-conditioned dual-attention fusion to improve cross-spectrum visual reasoning.
-
ChangeFlow -- Latent Rectified Flow for Change Detection in Remote Sensing
ChangeFlow reformulates remote sensing change detection as latent rectified-flow mask synthesis, reaching 80.4% average F1 across four benchmarks with 1.3-point gain and sampling-based ensembling.
-
OpenSGA: Efficient 3D Scene Graph Alignment in the Open World
OpenSGA fuses vision-language, textual, and geometric features via a distance-gated attention encoder and minimum-cost-flow allocator to outperform prior methods on both frame-to-scan and subscan-to-subscan 3D scene graph alignment, backed by a new 700k-sample ScanNet-SG dataset.
-
Delta-Adapter: Scalable Exemplar-Based Image Editing with Single-Pair Supervision
Delta-Adapter extracts a semantic delta from a single image pair via a pre-trained vision encoder and injects it through a Perceiver adapter to enable scalable single-pair supervised editing.
-
HAC: Parameter-Efficient Hyperbolic Adaptation of CLIP for Zero-Shot VQA
HAC provides a parameter-efficient way to move CLIP into hyperbolic geometry, yielding consistent gains on zero-shot VQA benchmarks without any VQA training data overlap.
-
Latent Space Probing for Adult Content Detection in Video Generative Models
Latent space probing on CogVideoX achieves 97.29% F1 for adult content detection on a new 11k-clip dataset with 4-6ms overhead.
-
Linear Image Generation by Synthesizing Exposure Brackets
The paper introduces a DiT-based flow-matching model that generates linear images by synthesizing text-conditioned exposure brackets to preserve full dynamic range.
-
Envisioning the Future, One Step at a Time
An autoregressive diffusion model on sparse point trajectories predicts multi-modal future scene dynamics from single images with orders-of-magnitude faster sampling than dense video simulators while matching accuracy.
-
Multi-Head Attention based interaction-aware architecture for Bangla Handwritten Character Recognition: Introducing a Primary Dataset
A new balanced Bangla handwritten character dataset paired with a multi-head attention hybrid model using EfficientNetB3, ViT, and Conformer achieves high accuracy and strong generalization.
-
Unified Vector Floorplan Generation via Markup Representation
A single transformer model using a new markup representation generates functional floorplans from diverse conditions and outperforms prior task-specific methods on the RPLAN dataset.
-
Deformation-based In-Context Learning for Point Cloud Understanding
DeformPIC deforms query point clouds under prompt guidance for in-context learning, outperforming prior methods with lower Chamfer Distance on reconstruction, denoising, and registration tasks.
-
ReWeaver: Towards Simulation-Ready and Topology-Accurate Garment Reconstruction
ReWeaver reconstructs topology-accurate 3D garments and sewing patterns from sparse multi-view images by predicting seams and panels in 2D UV and 3D space using a new 100k-sample synthetic dataset.
-
REVNET: Rotation-Equivariant Point Cloud Completion via Vector Neuron Anchor Transformer
REVNET is a rotation-equivariant point cloud completion model using Vector Neuron anchors and transformers that outperforms prior methods on synthetic MVP data and matches non-equivariant baselines on real KITTI data without input alignment.
-
Recurrent Video Masked Autoencoders
RVM uses recurrent computation inside a masked autoencoder to learn video representations that match or exceed prior video and image models on classification, tracking, and dense spatial tasks with up to 30x better parameter efficiency.
-
Distilling Specialized Orders for Visual Generation
OAR distills specialized generation orders from any-order AR models via self-distillation, improving FID from 2.39 to 2.17 on ImageNet 256x256 while preserving multi-task flexibility.
-
LRM: Large Reconstruction Model for Single Image to 3D
LRM is a large transformer that predicts a NeRF directly from a single image after training on a million-object multi-view dataset.
-
Segment Anything
A promptable model trained on 1B masks achieves competitive zero-shot segmentation performance across tasks and is released publicly with its dataset.
-
DreamFusion: Text-to-3D using 2D Diffusion
Optimizes a Neural Radiance Field via probability density distillation from a 2D diffusion model to produce text-conditioned 3D scenes viewable from any angle.
-
Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding
Imagen achieves state-of-the-art photorealistic text-to-image generation by scaling a text-only pretrained T5 language model within a diffusion framework, reaching FID 7.27 on COCO without training on it.
-
Switchable Normalization for Learning-to-Normalize Deep Representation
Switchable Normalization learns per-layer weights to combine channel, layer, and minibatch normalizers, claiming robustness to batch size and better results than fixed normalizers on ImageNet, COCO, CityScapes, ADE20K, MegaFace, and Kinetics.
-
Localizing Unseen Activities in Video via Image Query
Introduces Image-Based Activity Localization task for unseen activities, a self-attention interaction localizer using region self-attention and local transformer, and the ActivityIBAL dataset from ActivityNet.
-
Deep Modular Co-Attention Networks for Visual Question Answering
MCAN stacks modular co-attention layers to reach 70.63% accuracy on VQA-v2 test-dev, outperforming prior state-of-the-art models.
-
Stereo Magnification: Learning View Synthesis using Multiplane Images
A deep network predicts multiplane images from narrow-baseline stereo pairs to synthesize novel views that extrapolate beyond the input baseline.
-
PointSplat: Compact Gaussian Splatting via Human-Centric Prediction
PointSplat infers compact Gaussian splats directly in 3D space from input point sets via ray casting and Point-Image Transformer to reduce inter-view redundancy and improve novel-view quality for humans.
-
Unveiling Transferability in Trajectory Prediction via Latent Scene Embeddings
Framework learns latent scene embeddings from 24 trajectory datasets to produce transferability scores that correlate with cross-dataset model performance.
-
Learning from Reliable Latent Prompts for Visual Recognition with Missing Modalities
Proposes input-agnostic latent prompts for robust cross-modal compensation in missing-modality visual recognition, claiming SOTA on three benchmarks.
-
There and Back Again: A Flexible-Frame Transformer for Multi-Exposure Fusion
FreeMEF is the first flexible-frame transformer for multi-exposure fusion using a recurrent state space module and global feature guided block to handle variable numbers of input exposures.
-
Meta-learning as a principle for human-like visual representations
Meta-learning across diverse image-to-concept tasks yields visual representations that align better with human behavior and high-level visual cortex than standard pretraining.
-
Sensitivity as a Double-Edged Sword: A Trade-off Between Discriminability and Adversarial Robustness
Identifies sensitivity as the source of both discriminability and vulnerability in FC classifiers versus robustness in l2 classifiers, and introduces HPM prototype fusion plus MSA evaluation to improve adversarial robustness.
-
Scaling Parallel Sequence Models to Foundation-Scale Vision Encoders
C-GSPN scales 2D spatial propagation to foundation vision encoders via a fast CUDA kernel, compressed blocks, and two-stage distillation, matching ViT performance with 15% fewer parameters and 4x block speedup at 2K resolution.
-
LFA: Layer Feature Attention for Run-Time Introspection of 2D Object Detectors in Automated Driving
LFA aggregates multi-layer backbone features via attention to improve run-time prediction of 2D object detector failures, outperforming single-layer baselines on KITTI and BDD100K.
-
DinoComplete: 3D Shape Completion with Distilled Semantic Priors and State Space Models
DinoComplete augments geometric 3D shape completion with voxel-aligned DINO semantic priors and multi-scale voxel Mamba modeling to improve results on unseen categories with lower compute.
-
E$^3$C: Video Generation with 3D Environmental Memory and Ego-Exo Human Pose Control
E³C is a video diffusion model that disentangles persistent 3D scene structure via point-cloud memory from human dynamics via ego-exo pose controls for improved egocentric video generation on the Nymeria dataset.
-
Weierstrass Positional Encoding for Vision Transformers
WePE encodes 2D patch positions in Vision Transformers via Weierstrass elliptic functions on the complex plane to exploit double periodicity and derive relative positions algebraically.
-
PlantPose: Universal Plant Skeleton Estimation via Tree-constrained Graph Generation
PlantPose combines learned graph generation with classical tree-enforcing algorithms and a large mixed real/synthetic dataset to estimate arbitrary plant skeletons from varied image styles including out-of-domain cases.
-
Metonymy in vision models undermines attention-based interpretability
Pretrained vision transformers exhibit strong intra-object leakage where each part representation encodes information from the entire object, undermining the faithfulness of attention-based part-centric interpretability methods.
-
GTF: Omnidirectional EPI Transformer for Light Field Super-Resolution
GTF is an omnidirectional EPI Transformer for light field super-resolution that models horizontal, vertical, 45-degree and 135-degree epipolar geometries, reaching 32.78 dB on benchmarks and top ranks in the NTIRE 2026 challenge.
-
Linearizing Vision Transformer with Test-Time Training
Converts pretrained Vision Transformers to linear-complexity TTT models via architectural and representational alignment, demonstrated by linearizing Stable Diffusion 3.5 with 1-hour fine-tuning to match quality at 1.32-1.47x faster inference.
-
Ramen: Robust Test-Time Adaptation of Vision-Language Models with Active Sample Selection
Ramen enables robust test-time adaptation of vision-language models under mixed-domain shifts by actively selecting domain-consistent and prediction-balanced samples via an embedding-gradient cache.
-
Volume Transformer: Revisiting Vanilla Transformers for 3D Scene Understanding
A minimally modified vanilla Transformer called Volt achieves state-of-the-art 3D semantic and instance segmentation by using volumetric tokens, 3D rotary embeddings, and a data-efficient training recipe that scales better than domain-specific backbones.
-
EmbodiedHead: Real-Time Listening and Speaking Avatar for Conversational Agents
EmbodiedHead introduces a Rectified-Flow Diffusion Transformer with differentiable renderer and single-stream listening-speaking conditioning to achieve real-time high-fidelity conversational avatars.
-
STS-Mixer: Spatio-Temporal-Spectral Mixer for 4D Point Cloud Video Understanding
STS-Mixer decomposes 4D point cloud videos into multi-band spectral signals via graph transforms and mixes them with spatiotemporal representations to achieve better results on 3D action recognition and 4D semantic segmentation benchmarks.
-
Seeing Through Touch: Tactile-Driven Visual Localization of Material Regions
The model uses dense visuo-tactile feature interactions and material-diversity pairing on expanded datasets to generate tactile saliency maps for material segmentation, outperforming prior global-alignment methods.
-
Q-Zoom: Query-Aware Adaptive Perception for Efficient Multimodal Large Language Models
Q-Zoom achieves up to 4.39x inference speedup in high-resolution MLLM scenarios via query-aware gating and region localization, matching or exceeding baseline accuracy on document and high-res benchmarks.
-
Visual prompting reimagined: The power of the Activation Prompts
Activation prompts on intermediate layers outperform input-level visual prompting and parameter-efficient fine-tuning in accuracy and efficiency across 29 datasets.
-
R\'enyi Attention Entropy for Patch Pruning
Rényi entropy of attention maps serves as a tunable criterion for pruning redundant patches in vision transformers, reducing compute with preserved accuracy on image recognition.
-
VHOI: Controllable Video Generation of Human-Object Interactions from Sparse Trajectories via Motion Densification
VHOI densifies sparse trajectories into color-encoded HOI mask sequences and conditions a fine-tuned video diffusion model on them to produce controllable human-object interaction videos, including full navigation sequences.