GIANTS-4B, trained with RL on a new 17k-example benchmark of parent-to-child paper insights, achieves 34% relative improvement over gemini-3-pro in LM-judge similarity and is rated higher-impact by a citation predictor.
hub Mixed citations
Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free
Mixed citation behavior. Most common role is method (43%).
abstract
Gating mechanisms have been widely utilized, from early models like LSTMs and Highway Networks to recent state space models, linear attention, and also softmax attention. Yet, existing literature rarely examines the specific effects of gating. In this work, we conduct comprehensive experiments to systematically investigate gating-augmented softmax attention variants. Specifically, we perform a comprehensive comparison over 30 variants of 15B Mixture-of-Experts (MoE) models and 1.7B dense models trained on a 3.5 trillion token dataset. Our central finding is that a simple modification-applying a head-specific sigmoid gate after the Scaled Dot-Product Attention (SDPA)-consistently improves performance. This modification also enhances training stability, tolerates larger learning rates, and improves scaling properties. By comparing various gating positions and computational variants, we attribute this effectiveness to two key factors: (1) introducing non-linearity upon the low-rank mapping in the softmax attention, and (2) applying query-dependent sparse gating scores to modulate the SDPA output. Notably, we find this sparse gating mechanism mitigates 'attention sink' and enhances long-context extrapolation performance, and we also release related $\href{https://github.com/qiuzh20/gated_attention}{codes}$ and $\href{https://huggingface.co/QwQZh/gated_attention}{models}$ to facilitate future research.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
NuGNN applies a heterogeneous graph neural network to surrogate-solve a 690-isotope nuclear reaction network, achieving few-percent errors and reproducing final abundances where fully connected and Res-U-Net models fail.
Characterizes high-dimensional phase structure of momentum under sparse updates via closed-form second-moment dynamics, with regimes matching SGD, unstable, or heavy-ball depending on retention-to-learning timescale ratio.
Massive activations first appear in a single ME Layer due to RMSNorm and FFN, remain invariant thereafter, and a simple softening method raises LLM performance while reducing attention sinks.
FLUID is a continuous-time transformer using Liquid Attention Networks to model attention as stable ODE solutions that interpolate between discrete SDPA and CT-RNNs, with an explicit sink gate and liquid hyper-connections for better information flow.
DACG-IR adds a lightweight degradation-aware module that generates prompts to adaptively gate attention temperature, output features, and spatial-channel fusion in an encoder-decoder network for unified image restoration.
TokenFormer unifies multi-field and sequential recommendation modeling via bottom-full-top-sliding attention and non-linear interaction representations to avoid sequential collapse and deliver state-of-the-art performance.
Gradient-boosted attention applies a corrective second attention pass within a single layer, mapping to Friedman's gradient boosting and improving perplexity by 5.6-6.0% on WikiText-103 and OpenWebText subsets over standard attention.
ZipMap achieves linear-time bidirectional 3D reconstruction by zipping image collections into a compact stateful representation via test-time training layers.
GDLA delivers state-of-the-art accuracy on CT, MRI, ultrasound and dermoscopy segmentation benchmarks while keeping linear O(N) complexity in a PVT encoder-decoder.
TRC² is a brain-inspired decoder-only architecture that localizes fast plasticity and uses thalamic and hippocampal pathways to substantially reduce cumulative forgetting in sequential language model training on streams like C4, WikiText-103, and GSM8K.
DiNa-LRM introduces a diffusion-native latent reward model using a noise-calibrated Thurstone likelihood on noisy states, matching VLM performance at lower compute in image alignment and preference optimization.
Low-precision Flash Attention fails due to similar low-rank attention representations combined with biased rounding errors that accumulate and corrupt weight updates; a minimal fix to reduce rounding bias stabilizes training.
ST-Merge uses gated cross-attention to adaptively weight source models during merging, outperforming baselines on multilingual reasoning tasks across 21 languages.
Contribution Weights combine attention, value magnitude, and directional alignment to measure token influence more faithfully than attention alone, and show attention sinks actively suppress information via a convex sink-rate to output-norm relationship.
Empirical update to prior work shows most of 20 recent Transformer modifications do not transfer at 1-3B scales when measured with downstream CLIMB-12 tasks, multi-seed noise floor, and cross-scale stability.
Register tokens enhance pixel-space DiT training and output quality via cleaner high-noise feature maps, and a dual-stream design adds further gains with little overhead.
RigidFormer learns mesh-free rigid dynamics from point clouds using object-centric anchors, Anchor-Vertex Pooling, Anchor-based RoPE, and differentiable Kabsch alignment to enforce rigidity.
GEM is a new LiDAR world model using deformable Mamba that disentangles dynamic and static features to generate high-fidelity simulations and achieve state-of-the-art results on autonomous driving benchmarks.
Attention sinks arise from variance discrepancy in self-attention value aggregation, amplified by super neurons and first-token dimension disparity, and can be mitigated by head-wise RMSNorm to accelerate pre-training convergence.
ZAYA1-8B is a reasoning MoE model with 700M active parameters that matches larger models on math and coding benchmarks and reaches 91.9% on AIME'25 via Markovian RSA test-time compute.
HELIX uses learnable feature identities and hybrid temporal-feature attention to achieve state-of-the-art time series imputation across multiple datasets and settings.
HyLo upcycles Transformer LLMs into hybrids with MLA and Mamba2/Gated DeltaNet blocks via staged training and distillation, extending context to 2M tokens and outperforming prior upcycled hybrids on long-context benchmarks.
SinkRouter identifies attention sinks as training-derived fixed points and routes around them to skip redundant KV-cache loads, delivering up to 2.03x decoding speedup on long-context benchmarks.
citing papers explorer
-
GIANTS: Generative Insight Anticipation from Scientific Literature
GIANTS-4B, trained with RL on a new 17k-example benchmark of parent-to-child paper insights, achieves 34% relative improvement over gemini-3-pro in LM-judge similarity and is rated higher-impact by a citation predictor.
-
NuGNN: a Graph Neural Network for Nuclear Reaction Network Equations
NuGNN applies a heterogeneous graph neural network to surrogate-solve a 690-isotope nuclear reaction network, achieving few-percent errors and reproducing final abundances where fully connected and Res-U-Net models fail.
-
Dynamics of Stochastic Momentum with Sparse Updates in High Dimensions
Characterizes high-dimensional phase structure of momentum under sparse updates via closed-form second-moment dynamics, with regimes matching SGD, unstable, or heavy-ball depending on retention-to-learning timescale ratio.
-
A Single Layer to Explain Them All:Understanding Massive Activations in Large Language Models
Massive activations first appear in a single ME Layer due to RMSNorm and FFN, remain invariant thereafter, and a simple softening method raises LLM performance while reducing attention sinks.
-
FLUID: Continuous-Time Hyperconnected Sparse Transformer for Sink-Free Learning
FLUID is a continuous-time transformer using Liquid Attention Networks to model attention as stable ODE solutions that interpolate between discrete SDPA and CT-RNNs, with an explicit sink gate and liquid hyper-connections for better information flow.
-
Degradation-Aware Adaptive Context Gating for Unified Image Restoration
DACG-IR adds a lightweight degradation-aware module that generates prompts to adaptively gate attention temperature, output features, and spatial-channel fusion in an encoder-decoder network for unified image restoration.
-
TokenFormer: Unify the Multi-Field and Sequential Recommendation Worlds
TokenFormer unifies multi-field and sequential recommendation modeling via bottom-full-top-sliding attention and non-linear interaction representations to avoid sequential collapse and deliver state-of-the-art performance.
-
Gradient Boosting within a Single Attention Layer
Gradient-boosted attention applies a corrective second attention pass within a single layer, mapping to Friedman's gradient boosting and improving perplexity by 5.6-6.0% on WikiText-103 and OpenWebText subsets over standard attention.
-
ZipMap: Linear-Time Stateful 3D Reconstruction via Test-Time Training
ZipMap achieves linear-time bidirectional 3D reconstruction by zipping image collections into a compact stateful representation via test-time training layers.
-
Gated Differential Linear Attention: A Linear-Time Decoder for High-Fidelity Medical Segmentation
GDLA delivers state-of-the-art accuracy on CT, MRI, ultrasound and dermoscopy segmentation benchmarks while keeping linear O(N) complexity in a PVT encoder-decoder.
-
Efficient Continual Learning in Language Models via Thalamically Routed Cortical Columns
TRC² is a brain-inspired decoder-only architecture that localizes fast plasticity and uses thalamic and hippocampal pathways to substantially reduce cumulative forgetting in sequential language model training on streams like C4, WikiText-103, and GSM8K.
-
Beyond VLM-Based Rewards: Diffusion-Native Latent Reward Modeling
DiNa-LRM introduces a diffusion-native latent reward model using a noise-calibrated Thurstone likelihood on noisy states, matching VLM performance at lower compute in image alignment and preference optimization.
-
Why Low-Precision Transformer Training Fails: An Analysis on Flash Attention
Low-precision Flash Attention fails due to similar low-rank attention representations combined with biased rounding errors that accumulate and corrupt weight updates; a minimal fix to reduce rounding bias stabilizes training.
-
Enhancing Multilingual Reasoning via Steerable Model Merging
ST-Merge uses gated cross-attention to adaptively weight source models during merging, outperforming baselines on multilingual reasoning tasks across 21 languages.
-
Contribution Weights: A Geometrical Analysis of Self-Attention Transformers
Contribution Weights combine attention, value magnitude, and directional alignment to measure token influence more faithfully than attention alone, and show attention sinks actively suppress information via a convex sink-rate to output-norm relationship.
-
Most Transformer Modifications Still Do Not Transfer at 1-3B: A 2020-2026 Update to Narang et al. (2021) with Downstream Evaluation and a Noise Floor
Empirical update to prior work shows most of 20 recent Transformer modifications do not transfer at 1-3B scales when measured with downstream CLIMB-12 tasks, multi-seed noise floor, and cross-scale stability.
-
Registers Matter for Pixel-Space Diffusion Transformers
Register tokens enhance pixel-space DiT training and output quality via cleaner high-noise feature maps, and a dual-stream design adds further gains with little overhead.
-
RigidFormer: Learning Rigid Dynamics using Transformers
RigidFormer learns mesh-free rigid dynamics from point clouds using object-centric anchors, Anchor-Vertex Pooling, Anchor-based RoPE, and differentiable Kabsch alignment to enforce rigidity.
-
GEM: Generating LiDAR World Model via Deformable Mamba
GEM is a new LiDAR world model using deformable Mamba that disentangles dynamic and static features to generate high-fidelity simulations and achieve state-of-the-art results on autonomous driving benchmarks.
-
The Structural Origin of Attention Sink: Variance Discrepancy, Super Neurons, and Dimension Disparity
Attention sinks arise from variance discrepancy in self-attention value aggregation, amplified by super neurons and first-token dimension disparity, and can be mitigated by head-wise RMSNorm to accelerate pre-training convergence.
-
ZAYA1-8B Technical Report
ZAYA1-8B is a reasoning MoE model with 700M active parameters that matches larger models on math and coding benchmarks and reaches 91.9% on AIME'25 via Markovian RSA test-time compute.
-
HELIX: Hybrid Encoding with Learnable Identity and Cross-dimensional Synthesis for Time Series Imputation
HELIX uses learnable feature identities and hybrid temporal-feature attention to achieve state-of-the-art time series imputation across multiple datasets and settings.
-
Long-Context Aware Upcycling: A New Frontier for Hybrid LLM Scaling
HyLo upcycles Transformer LLMs into hybrids with MLA and Mamba2/Gated DeltaNet blocks via staged training and distillation, extending context to 2M tokens and outperforming prior upcycled hybrids on long-context benchmarks.
-
SinkRouter: Sink-Aware Routing for Efficient Long-Context Decoding in Large Language and Multimodal Models
SinkRouter identifies attention sinks as training-derived fixed points and routes around them to skip redundant KV-cache loads, delivering up to 2.03x decoding speedup on long-context benchmarks.
-
Long-Horizon Streaming Video Generation via Hybrid Attention with Decoupled Distillation
Hybrid Forcing combines linear temporal attention for long-range retention, block-sparse attention for efficiency, and decoupled distillation to achieve real-time unbounded 832x480 streaming video generation at 29.5 FPS.
-
Attention Editing: A Versatile Framework for Cross-Architecture Attention Conversion
Attention Editing converts pre-trained LLMs to new attention architectures through layer-wise teacher-forced optimization and model-level distillation, preserving performance with efficiency gains.
-
LSRM: High-Fidelity Object-Centric Reconstruction via Scaled Context Windows
LSRM scales transformer context windows with native sparse attention and geometric routing to deliver high-fidelity feed-forward 3D reconstruction and inverse rendering that approaches dense optimization quality.
-
Attention to Mamba: A Recipe for Cross-Architecture Distillation
A two-stage distillation recipe converts a Pythia-1B Transformer into a Mamba model that preserves performance with perplexity 14.11 versus the teacher's 13.86.
-
AgenticRS-Architecture: System Design for Agentic Recommender Systems
AutoModel uses three core agents (AutoTrain, AutoFeature, AutoPerf) connected by a shared coordination layer to automate model design, feature evolution, performance management, and paper-driven reproduction in large-scale recommender systems.
-
SeedPolicy: Horizon Scaling via Self-Evolving Diffusion Policy for Robot Manipulation
SeedPolicy introduces self-evolving gated attention to extend the temporal horizon of diffusion policies, yielding 36.8% and 169% relative gains over standard DP on clean and randomized RoboTwin 2.0 tasks.
-
SiameseNorm: Breaking the Barrier to Reconciling Pre/Post-Norm
SiameseNorm is a two-stream architecture that reconciles Pre-Norm and Post-Norm in Transformers by coupling streams via shared residual blocks, yielding performance gains with maintained stability on language, vision, and diffusion models.
-
BLASST: Dynamic BLocked Attention Sparsity via Softmax Thresholding
BLASST dynamically sparsifies attention by thresholding softmax scores to skip blocks, delivering 1.5x speedups at 70%+ sparsity while preserving benchmark accuracy.
-
Kimi Linear: An Expressive, Efficient Attention Architecture
Kimi Linear hybridizes linear attention with a new KDA module to beat full attention on tasks while slashing KV cache by 75% and speeding decoding up to 6x.
-
HorizonStream: Long-Horizon Attention for Streaming 3D Reconstruction
HorizonStream is a long-horizon Transformer that factorizes geometric evidence influence into channel-wise linear attention for long-range temporal propagation and local spatiotemporal attention for short-range matching, claiming stable generalization from 48-frame training to over 10,000-frame test
-
HoloMotion-1 Technical Report
HoloMotion-1 trains a MoE Transformer policy on hybrid video and MoCap motion data to achieve robust zero-shot tracking that transfers directly to real humanoid robots.
-
Mela: Test-Time Memory Consolidation based on Transformation Hypothesis
Mela is a Transformer variant with a dual-frequency Hierarchical Memory Module and MemStack that performs test-time memory consolidation, outperforming baselines on long contexts.
-
SlimQwen: Exploring the Pruning and Distillation in Large MoE Model Pre-training
Pruning pretrained MoE models outperforms training from scratch under fixed budget, different expert compression methods converge after continued training, and progressive pruning plus multi-token KD improves the final 23A2B model.
-
Cubit: Token Mixer with Kernel Ridge Regression
Cubit replaces Transformer's attention with a closed-form Kernel Ridge Regression token mixer and reports larger gains as training sequence length increases.
-
Heterogeneous Scientific Foundation Model Collaboration
Eywa enables language-based agentic AI systems to collaborate with specialized scientific foundation models for improved performance on structured data tasks.
-
When Does Removing LayerNorm Help? Activation Bounding as a Regime-Dependent Implicit Regularizer
DyT improves validation loss 27% at 64M params/1M tokens but worsens it 19% at 118M tokens, with saturation levels predicting the sign of the effect.
-
Gated Memory Policy
GMP selectively activates and represents memory via a gate and lightweight cross-attention, yielding 30.1% higher success on non-Markovian robotic tasks while staying competitive on Markovian ones.
-
LACE: Lattice Attention for Cross-thread Exploration
LACE enables concurrent reasoning paths in LLMs to interact via lattice attention and a synthetic training pipeline, raising accuracy more than 7 points over independent parallel search.
-
MiMo-V2-Flash Technical Report
MiMo-V2-Flash is a 309B/15B MoE model trained on 27T tokens with hybrid attention and multi-teacher on-policy distillation that matches larger models like DeepSeek-V3.2 while enabling 2.6x faster decoding via repurposed MTP layers.
-
TTT3R: 3D Reconstruction as Test-Time Training
TTT3R derives a closed-form learning rate from memory-observation alignment confidence to boost length generalization in RNN-based 3D reconstruction by 2x in global pose estimation.
-
GLACIER: A Multimodal Student-Teacher Foundation Model for Molecular Property Prediction
GLACIER combines graph, SMILES, and descriptor encoders with Finsler fusion and contrastive distillation to produce an efficient multimodal model for molecular property prediction.
-
Better Models, Faster Training: Sigmoid Attention for single-cell Foundation Models
Sigmoid attention replaces softmax in single-cell foundation models to deliver better representations, faster training, and stability, backed by bounded derivatives, diagonal Jacobian, and a new efficient GPU kernel.
-
Joint Model Parameter Scaling and Universal-Domain Data Integration for E-commerce Search Ranking
UniScale couples entire-space data construction with a hierarchical fusion transformer to improve scaling behavior and deliver 1.70% purchase and 2.04% GMV lifts in large-scale e-commerce search A/B tests.
-
Hybrid Architectures for Language Models: Systematic Analysis and Design Insights
This work systematically compares inter-layer and intra-layer hybridization strategies for combining self-attention and Mamba-style state space models, evaluating them on language modeling, downstream tasks, long-context performance, scaling, and efficiency to derive optimal design recipes.
-
Learning-Based Spectrum Cartography in Low Earth Orbit Satellite Networks: An Overview
The paper overviews attention-based learning methods for spectrum cartography in LEO satellite networks to enable adaptive fusion of heterogeneous measurements for inference and resource allocation.
-
A Cellular Doctrine of Morality: Intrinsic Active Precision and the Mind-Reality Overload Dilemma
AI incorporating active precision from pyramidal neurons may reduce information overload by evaluating evidence coherence before attention rather than maximizing rewards.