International Conference on Machine Learning (ICML) , year=

On Layer Normalization in the Transformer Architecture , author= · 2002 · arXiv 2002.04745

15 Pith papers cite this work. Polarity classification is still indexing.

15 Pith papers citing it

read on arXiv browse 15 citing papers

citation-role summary

background 1 method 1

citation-polarity summary

support 1 use method 1

representative citing papers

Stability and Generalization in Looped Transformers

cs.LG · 2026-04-16 · unverdicted · novelty 8.0

Looped transformers with recall and outer normalization produce reachable, input-dependent fixed points with stable gradients, enabling generalization, while those without recall cannot; a new internal recall variant performs competitively or better.

Every Feedforward Neural Network Definable in an o-Minimal Structure Has Finite Sample Complexity

stat.ML · 2026-05-08 · unverdicted · novelty 7.0

Every fixed finite feedforward neural network definable in an o-minimal structure has finite sample complexity in the agnostic PAC setting.

Graph Transformers and Stabilized Reinforcement Learning for Large-Scale Dynamic Routing Modulation and Spectrum Allocation in Elastic Optical Networks

cs.NI · 2026-05-03 · unverdicted · novelty 7.0 · 2 refs

A graph transformer with RL stabilizations is the first to exceed benchmarks for dynamic RMSA, supporting up to 13% more traffic load on networks up to 143 nodes.

Longformer: The Long-Document Transformer

cs.CL · 2020-04-10 · accept · novelty 7.0

Longformer uses local windowed attention plus task-specific global attention to achieve linear scaling and state-of-the-art results on long-document language modeling, QA, and summarization after pretraining.

Modeling Local, Global, and Cross-Modal Context in Multimodal 3D MRI

cs.CV · 2026-06-25 · unverdicted · novelty 6.0

MICViT outperforms CNN and transformer baselines on brain age prediction from multimodal 3D MRI by combining modality-specific and cross-modal local/global attention across three heterogeneous datasets.

Improving Neural Network Training by Decoupling the Magnitude and Direction of Weight Vectors

cs.LG · 2026-06-24 · unverdicted · novelty 6.0

MD Decoupling factorizes weights into fixed-norm directions and learnable per-row/column magnitudes updated at independent rates, improving Adam and Muon training stability and scale transfer without weight decay or warmup.

Quantization Inflates Reasoning: Token Inflation as a Hidden Cost of Low-Bit Reasoning Models

cs.AI · 2026-06-24 · unverdicted · novelty 6.0 · 2 refs

Quantized reasoning models produce longer chains of thought, inflating token usage and negating per-token speedups from low-bit quantization across multiple benchmarks.

When Does Routing Become Interpretable? Causal Probes on Block Attention Residuals

cs.LG · 2026-06-11 · unverdicted · novelty 6.0

Block Attention Residuals make routing observable as a tensor, but causal probes on trained versus baseline 0.6B models show routing mass often fails to predict causal contribution and structured motifs require training.

A Geometric Analysis of Sign-Magnitude Asymmetry in a ReLU + RMSNorm Block under Ternary Quantization

cs.LG · 2026-05-18 · unverdicted · novelty 6.0

Sign-flip perturbations produce π/(π-2) ≈ 2.75 times more transverse output energy than equal-norm sign-preserving perturbations in a ReLU + RMSNorm block because ReLU creates directional asymmetry that RMSNorm's transverse projection exposes.

Gated Normalization Removal and Scale Anchoring in Pre-Norm Transformers

cs.LG · 2026-02-11 · unverdicted · novelty 6.0

TaperNorm gradually removes internal normalization in pre-norm transformers via learned gates that reach zero, revealing final norm as a scale anchor and enabling up to 1.18x faster KV-cached decoding with small loss increases.

Review Residuals: Update-Conditioned Residual Gating for Transformers

cs.LG · 2026-06-30 · unverdicted · novelty 5.0

Review Residuals add an update-conditioned gate to transformer residual connections, yielding depth-stable training and performance gains that emerge and grow with model size from 590M parameters upward.

Predicting the thermodynamics in the chromosphere from the translation of SDO data into the IRIS$^{2}$ inversion results using a visual transformer model

astro-ph.SR · 2026-04-23 · unverdicted · novelty 5.0

A visual transformer model trained on IRIS inversions predicts chromospheric temperature and density from SDO data with correlations around 0.8 on 80% of test cases.

Attention Residuals

cs.CL · 2026-03-16 · unverdicted · novelty 5.0

Attention Residuals replaces fixed residual summation with input-dependent softmax attention over preceding layers, and a blocked variant is shown to improve uniformity and downstream performance in a 48B-parameter model pre-trained on 1.4T tokens.

Multi-Gate Residuals

cs.LG · 2026-05-22 · unverdicted · novelty 3.0

Multi-Gate Residuals stabilizes activation scales in deep residual networks via multi-stream gating and attention pooling without added communication overhead.

LLMOrbit: A Circular Taxonomy of Large Language Models -From Scaling Walls to Agentic AI Systems

cs.LG · 2026-01-20 · unverdicted · novelty 3.0

A survey taxonomy of LLMs identifies three scaling crises and six efficiency paradigms while tracing the shift from generation to tool-using agents.

citing papers explorer

Showing 0 of 0 citing papers after filters.

No citing papers match the current filters.

International Conference on Machine Learning (ICML) , year=

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer