hub Mixed citations

Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free

Zihan Qiu, Zekun Wang, Bo Zheng, Zeyu Huang, Kaiyue Wen, Songlin Yang · 2025 · cs.CL · arXiv 2505.06708

Mixed citation behavior. Most common role is method (43%).

63 Pith papers citing it

Method 43% of classified citations

open full Pith review browse 63 citing papers arXiv PDF

abstract

Gating mechanisms have been widely utilized, from early models like LSTMs and Highway Networks to recent state space models, linear attention, and also softmax attention. Yet, existing literature rarely examines the specific effects of gating. In this work, we conduct comprehensive experiments to systematically investigate gating-augmented softmax attention variants. Specifically, we perform a comprehensive comparison over 30 variants of 15B Mixture-of-Experts (MoE) models and 1.7B dense models trained on a 3.5 trillion token dataset. Our central finding is that a simple modification-applying a head-specific sigmoid gate after the Scaled Dot-Product Attention (SDPA)-consistently improves performance. This modification also enhances training stability, tolerates larger learning rates, and improves scaling properties. By comparing various gating positions and computational variants, we attribute this effectiveness to two key factors: (1) introducing non-linearity upon the low-rank mapping in the softmax attention, and (2) applying query-dependent sparse gating scores to modulate the SDPA output. Notably, we find this sparse gating mechanism mitigates 'attention sink' and enhances long-context extrapolation performance, and we also release related $\href{https://github.com/qiuzh20/gated_attention}{codes}$ and $\href{https://huggingface.co/QwQZh/gated_attention}{models}$ to facilitate future research.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 7 method 6 dataset 1

citation-polarity summary

use method 6 background 5 support 1 unclear 1 use dataset 1

representative citing papers

GIANTS: Generative Insight Anticipation from Scientific Literature

cs.CL · 2026-04-10 · unverdicted · novelty 8.0

GIANTS-4B, trained with RL on a new 17k-example benchmark of parent-to-child paper insights, achieves 34% relative improvement over gemini-3-pro in LM-judge similarity and is rated higher-impact by a citation predictor.

Scaling Storm-Resolving Atmospheric AI Simulation to the Entire Planet

physics.ao-ph · 2026-06-30 · unverdicted · novelty 7.0

STRATA is the first autoregressive transformer emulator for global 4.9-km storm-resolving atmospheric dynamics, achieving 50x better energy efficiency than the underlying physics model while producing realistic km-scale features in 24-hour forecasts.

Memory Retrieval in Visuomotor Policies for Long-Horizon Robot Control

cs.RO · 2026-06-23 · unverdicted · novelty 7.0

HALO distills VLM priors via question-answering objectives and applies sparse attention to enable reliable memory retrieval from up to eight minutes of history in imitation-learned visuomotor policies.

Tapered Language Models

cs.LG · 2026-06-22 · unverdicted · novelty 7.0

Tapered Language Models monotonically decrease MLP width across depth with a cosine schedule, yielding better perplexity and downstream performance than uniform-width baselines across multiple architectures and scales at no extra cost.

NuGNN: a Graph Neural Network for Nuclear Reaction Network Equations

nucl-th · 2026-06-03 · unverdicted · novelty 7.0

NuGNN applies a heterogeneous graph neural network to surrogate-solve a 690-isotope nuclear reaction network, achieving few-percent errors and reproducing final abundances where fully connected and Res-U-Net models fail.

Dynamics of Stochastic Momentum with Sparse Updates in High Dimensions

stat.ML · 2026-05-27 · unverdicted · novelty 7.0

Characterizes high-dimensional phase structure of momentum under sparse updates via closed-form second-moment dynamics, with regimes matching SGD, unstable, or heavy-ball depending on retention-to-learning timescale ratio.

Meta-Attention: Bayesian Per-Token Routing for Efficient Transformer Inference

cs.LG · 2026-05-27 · unverdicted · novelty 7.0

Meta-Attention introduces per-token Bayesian routing among attention mechanisms via amortised variational inference with a Dirichlet prior, yielding lower projected FLOP cost than prior-free routing on a Tiny LM benchmark.

A Single Layer to Explain Them All:Understanding Massive Activations in Large Language Models

cs.CL · 2026-05-08 · conditional · novelty 7.0 · 2 refs

Massive activations first appear in a single ME Layer due to RMSNorm and FFN, remain invariant thereafter, and a simple softening method raises LLM performance while reducing attention sinks.

FLUID: Continuous-Time Hyperconnected Sparse Transformer for Sink-Free Learning

cs.LG · 2026-05-06 · unverdicted · novelty 7.0

FLUID is a continuous-time transformer using Liquid Attention Networks to model attention as stable ODE solutions that interpolate between discrete SDPA and CT-RNNs, with an explicit sink gate and liquid hyper-connections for better information flow.

Degradation-Aware Adaptive Context Gating for Unified Image Restoration

cs.CV · 2026-05-02 · unverdicted · novelty 7.0

DACG-IR adds a lightweight degradation-aware module that generates prompts to adaptively gate attention temperature, output features, and spatial-channel fusion in an encoder-decoder network for unified image restoration.

TokenFormer: Unify the Multi-Field and Sequential Recommendation Worlds

cs.IR · 2026-04-15 · unverdicted · novelty 7.0

TokenFormer unifies multi-field and sequential recommendation modeling via bottom-full-top-sliding attention and non-linear interaction representations to avoid sequential collapse and deliver state-of-the-art performance.

Gradient Boosting within a Single Attention Layer

cs.LG · 2026-04-03 · conditional · novelty 7.0

Gradient-boosted attention applies a corrective second attention pass within a single layer, mapping to Friedman's gradient boosting and improving perplexity by 5.6-6.0% on WikiText-103 and OpenWebText subsets over standard attention.

ZipMap: Linear-Time Stateful 3D Reconstruction via Test-Time Training

cs.CV · 2026-03-04 · unverdicted · novelty 7.0

ZipMap achieves linear-time bidirectional 3D reconstruction by zipping image collections into a compact stateful representation via test-time training layers.

Gated Differential Linear Attention: A Linear-Time Decoder for High-Fidelity Medical Segmentation

cs.CV · 2026-03-03 · unverdicted · novelty 7.0

GDLA delivers state-of-the-art accuracy on CT, MRI, ultrasound and dermoscopy segmentation benchmarks while keeping linear O(N) complexity in a PVT encoder-decoder.

Efficient Continual Learning in Language Models via Thalamically Routed Cortical Columns

cs.LG · 2026-02-25 · unverdicted · novelty 7.0

TRC² is a brain-inspired decoder-only architecture that localizes fast plasticity and uses thalamic and hippocampal pathways to substantially reduce cumulative forgetting in sequential language model training on streams like C4, WikiText-103, and GSM8K.

Beyond VLM-Based Rewards: Diffusion-Native Latent Reward Modeling

cs.CV · 2026-02-11 · unverdicted · novelty 7.0

DiNa-LRM introduces a diffusion-native latent reward model using a noise-calibrated Thurstone likelihood on noisy states, matching VLM performance at lower compute in image alignment and preference optimization.

Why Low-Precision Transformer Training Fails: An Analysis on Flash Attention

cs.LG · 2025-10-05 · unverdicted · novelty 7.0

Low-precision Flash Attention fails due to similar low-rank attention representations combined with biased rounding errors that accumulate and corrupt weight updates; a minimal fix to reduce rounding bias stabilizes training.

MIMFlow: Integrating Masked Image Modeling with Normalizing Flows for End-to-End Image Generation

cs.CV · 2026-06-24 · unverdicted · novelty 6.0

MIMFlow uses a VAE on masked images to feed semantic latents to a normalizing flow while a decoder handles high-frequency details, reporting FID 2.50 and 71.3% linear probing on ImageNet 256x256 with 128 tokens.

QG-MIL: A Gated Transformer Aggregator for Domain-Agnostic Multiple Instance Learning in Medical Imaging

cs.CV · 2026-06-18 · unverdicted · novelty 6.0

QG-MIL introduces four gated transformer components that yield +6.1 average macro F1 improvement over baselines on six whole-slide and cell-level medical imaging benchmarks while producing more uniform attention.

Enhancing Multilingual Reasoning via Steerable Model Merging

cs.CL · 2026-06-17 · unverdicted · novelty 6.0

ST-Merge uses gated cross-attention to adaptively weight source models during merging, outperforming baselines on multilingual reasoning tasks across 21 languages.

Emergent Misalignment Can Be Induced by Sycophancy and Reversed via Alignment Gating

cs.CL · 2026-06-08 · unverdicted · novelty 6.0

Sycophancy fine-tuning induces emergent misalignment in LLMs that Alignment Gating can reverse by learning to suppress unsafe representations with generalization from narrow to broad domains.

Contribution Weights: A Geometrical Analysis of Self-Attention Transformers

cs.LG · 2026-05-29 · unverdicted · novelty 6.0

Contribution Weights combine attention, value magnitude, and directional alignment to measure token influence more faithfully than attention alone, and show attention sinks actively suppress information via a convex sink-rate to output-norm relationship.

MACReD: A Multi-Agent Collaborative Reasoning Framework for Reaction Diagram Parsing

cs.AI · 2026-05-27 · unverdicted · novelty 6.0

MACReD is a multi-agent collaborative reasoning framework for reaction diagram parsing that reports state-of-the-art F1 scores of 75.2% and 84.6% on the RxnScribe benchmark.

Inference Time Optimization with Confidence Dynamics

cs.CL · 2026-05-24 · unverdicted · novelty 6.0

Correct reasoning traces exhibit positive confidence gain while incorrect traces show declining confidence, enabling CDG-based voting that boosts performance on AIME, HMMT and BRUMO benchmarks across multiple LLM architectures.

citing papers explorer

Showing 0 of 0 citing papers after filters.

No citing papers match the current filters.

Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer