pith. machine review for the scientific record. sign in

arxiv: 2505.06708 · v1 · submitted 2025-05-10 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free

Bo Zheng, Dayiheng Liu, Fei Huang, Jingren Zhou, Junyang Lin, Kaiyue Wen, Le Yu, Rui Men, Songlin Yang, Suozhi Huang, Zekun Wang, Zeyu Huang, Zihan Qiu

Pith reviewed 2026-05-12 08:59 UTC · model grok-4.3

classification 💻 cs.CL
keywords gated attentionsoftmax attentionattention sinklarge language modelsmixture-of-expertsnon-linearitysparsitylong-context extrapolation
0
0 comments X

The pith

A head-specific sigmoid gate after scaled dot-product attention improves large language model performance, training stability, and long-context handling.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper conducts extensive experiments on gating variants for softmax attention in large models, comparing 30 configurations across 15B mixture-of-experts and 1.7B dense models trained on 3.5 trillion tokens. It finds that inserting a simple head-specific sigmoid gate directly after the standard scaled dot-product attention step produces consistent gains in final model quality. These changes also make training more stable, allow higher learning rates, and yield better scaling behavior. The authors link the improvements to the gate adding non-linearity after the low-rank attention computation and imposing query-dependent sparse modulation on the outputs. This sparse modulation further reduces attention sink effects and supports stronger long-context extrapolation.

Core claim

Our central finding is that a simple modification—applying a head-specific sigmoid gate after the Scaled Dot-Product Attention (SDPA)—consistently improves performance. This modification also enhances training stability, tolerates larger learning rates, and improves scaling properties. By comparing various gating positions and computational variants, we attribute this effectiveness to two key factors: (1) introducing non-linearity upon the low-rank mapping in the softmax attention, and (2) applying query-dependent sparse gating scores to modulate the SDPA output. Notably, we find this sparse gating mechanism mitigates 'attention sink' and enhances long-context extrapolation performance.

What carries the argument

head-specific sigmoid gate applied after scaled dot-product attention, introducing non-linearity and query-dependent sparsity

If this is right

  • Training runs become more stable and tolerate larger learning rates without divergence.
  • Models exhibit improved scaling trends when trained on larger datasets and bigger parameter counts.
  • Attention sink is reduced, yielding better performance on long sequences without additional positional fixes.
  • The gains hold for both dense models and mixture-of-experts architectures.
  • The same gating position and form can be compared against other placements such as before the attention computation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The query-dependent sparsity might be combined with token-pruning methods to lower inference cost on very long inputs.
  • The added non-linearity could be tested in linear-attention or state-space models to check whether similar gains appear outside softmax attention.
  • Releasing the code and models allows direct replication and extension to new architectures or training regimes.
  • Future scaling studies could measure whether the improved scaling slope persists at even larger model sizes.

Load-bearing premise

That the performance gains and attention-sink mitigation arise specifically from the non-linearity and query-dependent sparsity of the sigmoid gate rather than from the added parameters or other uncontrolled factors in the 30-variant experiments.

What would settle it

Train matched models that keep the same parameter count but replace the sigmoid gate with a linear function or make the gate query-independent; check whether the performance, stability, and sink-mitigation advantages disappear.

read the original abstract

Gating mechanisms have been widely utilized, from early models like LSTMs and Highway Networks to recent state space models, linear attention, and also softmax attention. Yet, existing literature rarely examines the specific effects of gating. In this work, we conduct comprehensive experiments to systematically investigate gating-augmented softmax attention variants. Specifically, we perform a comprehensive comparison over 30 variants of 15B Mixture-of-Experts (MoE) models and 1.7B dense models trained on a 3.5 trillion token dataset. Our central finding is that a simple modification-applying a head-specific sigmoid gate after the Scaled Dot-Product Attention (SDPA)-consistently improves performance. This modification also enhances training stability, tolerates larger learning rates, and improves scaling properties. By comparing various gating positions and computational variants, we attribute this effectiveness to two key factors: (1) introducing non-linearity upon the low-rank mapping in the softmax attention, and (2) applying query-dependent sparse gating scores to modulate the SDPA output. Notably, we find this sparse gating mechanism mitigates 'attention sink' and enhances long-context extrapolation performance, and we also release related $\href{https://github.com/qiuzh20/gated_attention}{codes}$ and $\href{https://huggingface.co/QwQZh/gated_attention}{models}$ to facilitate future research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript claims that applying a head-specific sigmoid gate after the Scaled Dot-Product Attention (SDPA) in softmax attention consistently improves performance in large language models. This is supported by training 30 variants of 15B MoE and 1.7B dense models on 3.5 trillion tokens, showing benefits in performance, stability, learning rate tolerance, and scaling. The effectiveness is attributed to non-linearity on the low-rank mapping and query-dependent sparse gating, which mitigates attention sinks and improves long-context performance. Code and models are released.

Significance. The large-scale empirical evaluation across model scales and architectures provides substantial support for a simple modification to the attention mechanism. The release of code and models aids reproducibility. If the gains are specifically due to the claimed non-linearity and sparsity rather than incidental factors, this could influence future LLM designs by improving stability and long-context capabilities with minimal overhead.

major comments (1)
  1. [Ablation studies and variant comparisons] Ablation studies and variant comparisons: The paper compares gating positions and computational variants to attribute gains to non-linearity and query-dependent sparsity. However, without explicit parameter-count-matched or FLOPs-matched baselines (e.g., adding dummy learnable parameters or fixed scalers to vanilla SDPA), the attribution remains open to the possibility that improvements arise from added capacity or other uncontrolled aspects of the 30-variant setup rather than the specific mechanisms. This is load-bearing for the central attribution claim.
minor comments (2)
  1. [Abstract] The abstract would benefit from briefly noting the exact baselines, any statistical tests, and key ablation controls to better convey the experimental rigor upfront.
  2. [Figures] Ensure all figures clearly label the 30 variants and include error bars or multiple runs where performance differences are reported.

Simulated Author's Rebuttal

1 responses · 0 unresolved

Thank you for the detailed review and the recommendation for minor revision. We address the major comment regarding ablation studies and variant comparisons below.

read point-by-point responses
  1. Referee: The paper compares gating positions and computational variants to attribute gains to non-linearity and query-dependent sparsity. However, without explicit parameter-count-matched or FLOPs-matched baselines (e.g., adding dummy learnable parameters or fixed scalers to vanilla SDPA), the attribution remains open to the possibility that improvements arise from added capacity or other uncontrolled aspects of the 30-variant setup rather than the specific mechanisms. This is load-bearing for the central attribution claim.

    Authors: We thank the referee for highlighting this important point on controlling for model capacity. Our 30 variants include multiple gating positions (pre- and post-SDPA) and computational forms (e.g., different ways to compute the gate), all of which introduce comparable numbers of additional parameters. Notably, only the post-SDPA head-specific sigmoid consistently yields improvements across metrics, while other variants with similar parameter overhead do not. This differential effect supports that the gains stem from the specific non-linearity and query-dependent sparsity rather than capacity alone. Additionally, the observed benefits in training stability, higher learning rate tolerance, and mitigation of attention sinks are difficult to explain by parameter count increases alone. Nevertheless, to further strengthen the claim, we will add parameter-matched baselines using dummy learnable parameters or fixed scalers in the revised version, at least for the smaller 1.7B dense model scale where retraining is more feasible. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results from direct model training and variant comparisons.

full rationale

The paper presents no derivation chain, first-principles prediction, or mathematical reduction. Its central claims—that a head-specific sigmoid gate after SDPA improves performance, stability, LR tolerance, and long-context behavior—are supported solely by training 30 variants of 15B MoE and 1.7B dense models on 3.5T tokens and measuring outcomes. Attribution to non-linearity and query-dependent sparsity is made by comparing gating positions and computational forms within the same experimental setup; these are independent empirical tests, not tautologies or fits renamed as predictions. No self-citation is load-bearing for the results, and no equation or claim reduces to its own inputs by construction. The work is self-contained against external benchmarks via released code and models.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper is empirical and introduces no new free parameters, no ad-hoc axioms beyond standard transformer training assumptions, and no invented entities.

axioms (1)
  • standard math Standard scaled dot-product attention and mixture-of-experts training procedures function as described in prior literature
    The experiments build directly on established transformer and MoE implementations without re-deriving them.

pith-pipeline@v0.9.0 · 5591 in / 1215 out tokens · 36176 ms · 2026-05-12T08:59:17.258236+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 33 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. GIANTS: Generative Insight Anticipation from Scientific Literature

    cs.CL 2026-04 unverdicted novelty 8.0

    GIANTS-4B, trained with RL on a new 17k-example benchmark of parent-to-child paper insights, achieves 34% relative improvement over gemini-3-pro in LM-judge similarity and is rated higher-impact by a citation predictor.

  2. A Single Layer to Explain Them All:Understanding Massive Activations in Large Language Models

    cs.CL 2026-05 unverdicted novelty 7.0

    Massive activations originate in a specific ME Layer across LLM families; reducing their token rigidity via a targeted method boosts performance and mitigates attention sinks.

  3. A Single Layer to Explain Them All:Understanding Massive Activations in Large Language Models

    cs.CL 2026-05 conditional novelty 7.0

    Massive activations first appear in a single ME Layer due to RMSNorm and FFN, remain invariant thereafter, and a simple softening method raises LLM performance while reducing attention sinks.

  4. FLUID: Continuous-Time Hyperconnected Sparse Transformer for Sink-Free Learning

    cs.LG 2026-05 unverdicted novelty 7.0

    FLUID is a continuous-time transformer using Liquid Attention Networks to model attention as stable ODE solutions that interpolate between discrete SDPA and CT-RNNs, with an explicit sink gate and liquid hyper-connect...

  5. Degradation-Aware Adaptive Context Gating for Unified Image Restoration

    cs.CV 2026-05 unverdicted novelty 7.0

    DACG-IR adds a lightweight degradation-aware module that generates prompts to adaptively gate attention temperature, output features, and spatial-channel fusion in an encoder-decoder network for unified image restoration.

  6. TokenFormer: Unify the Multi-Field and Sequential Recommendation Worlds

    cs.IR 2026-04 unverdicted novelty 7.0

    TokenFormer unifies multi-field and sequential recommendation modeling via bottom-full-top-sliding attention and non-linear interaction representations to avoid sequential collapse and deliver state-of-the-art performance.

  7. Gradient Boosting within a Single Attention Layer

    cs.LG 2026-04 conditional novelty 7.0

    Gradient-boosted attention applies a corrective second attention pass within a single layer, mapping to Friedman's gradient boosting and improving perplexity by 5.6-6.0% on WikiText-103 and OpenWebText subsets over st...

  8. RigidFormer: Learning Rigid Dynamics using Transformers

    cs.CV 2026-05 unverdicted novelty 6.0

    RigidFormer learns mesh-free rigid dynamics from point clouds using object-centric anchors, Anchor-Vertex Pooling, Anchor-based RoPE, and differentiable Kabsch alignment to enforce rigidity.

  9. SlimQwen: Exploring the Pruning and Distillation in Large MoE Model Pre-training

    cs.LG 2026-05 unverdicted novelty 6.0

    Pruning pretrained MoE models outperforms training from scratch, different compression methods converge after continued pretraining, and combining KD with language modeling loss plus progressive schedules yields a com...

  10. GEM: Generating LiDAR World Model via Deformable Mamba

    cs.CV 2026-05 unverdicted novelty 6.0

    GEM is a new LiDAR world model using deformable Mamba that disentangles dynamic and static features to generate high-fidelity simulations and achieve state-of-the-art results on autonomous driving benchmarks.

  11. The Structural Origin of Attention Sink: Variance Discrepancy, Super Neurons, and Dimension Disparity

    cs.LG 2026-05 unverdicted novelty 6.0

    Attention sinks arise from variance discrepancy in self-attention value aggregation, amplified by super neurons and first-token dimension disparity, and can be mitigated by head-wise RMSNorm to accelerate pre-training...

  12. Cubit: Token Mixer with Kernel Ridge Regression

    cs.LG 2026-05 unverdicted novelty 6.0

    Cubit replaces Transformer attention with Kernel Ridge Regression token mixing and shows potential gains on longer sequences.

  13. ZAYA1-8B Technical Report

    cs.AI 2026-05 unverdicted novelty 6.0

    ZAYA1-8B is a reasoning MoE model with 700M active parameters that matches larger models on math and coding benchmarks and reaches 91.9% on AIME'25 via Markovian RSA test-time compute.

  14. HELIX: Hybrid Encoding with Learnable Identity and Cross-dimensional Synthesis for Time Series Imputation

    cs.LG 2026-05 unverdicted novelty 6.0

    HELIX uses learnable feature identities and hybrid temporal-feature attention to achieve state-of-the-art time series imputation across multiple datasets and settings.

  15. Long-Context Aware Upcycling: A New Frontier for Hybrid LLM Scaling

    cs.CL 2026-04 unverdicted novelty 6.0

    HyLo upcycles Transformer LLMs into hybrids with MLA and Mamba2/Gated DeltaNet blocks via staged training and distillation, extending context to 2M tokens and outperforming prior upcycled hybrids on long-context benchmarks.

  16. SinkRouter: Sink-Aware Routing for Efficient Long-Context Decoding in Large Language and Multimodal Models

    cs.LG 2026-04 unverdicted novelty 6.0

    SinkRouter identifies attention sinks as training-derived fixed points and routes around them to skip redundant KV-cache loads, delivering up to 2.03x decoding speedup on long-context benchmarks.

  17. LACE: Lattice Attention for Cross-thread Exploration

    cs.AI 2026-04 unverdicted novelty 6.0

    LACE enables parallel reasoning paths in LLMs to communicate via lattice attention and error-correct using synthetic training data, improving accuracy by over 7 points over standard parallel search.

  18. Long-Horizon Streaming Video Generation via Hybrid Attention with Decoupled Distillation

    cs.CV 2026-04 conditional novelty 6.0

    Hybrid Forcing combines linear temporal attention for long-range retention, block-sparse attention for efficiency, and decoupled distillation to achieve real-time unbounded 832x480 streaming video generation at 29.5 FPS.

  19. Attention Editing: A Versatile Framework for Cross-Architecture Attention Conversion

    cs.CL 2026-04 conditional novelty 6.0

    Attention Editing converts pre-trained LLMs to new attention architectures through layer-wise teacher-forced optimization and model-level distillation, preserving performance with efficiency gains.

  20. LSRM: High-Fidelity Object-Centric Reconstruction via Scaled Context Windows

    cs.CV 2026-04 conditional novelty 6.0

    LSRM scales transformer context windows with native sparse attention and geometric routing to deliver high-fidelity feed-forward 3D reconstruction and inverse rendering that approaches dense optimization quality.

  21. Attention to Mamba: A Recipe for Cross-Architecture Distillation

    cs.CL 2026-04 unverdicted novelty 6.0

    A two-stage distillation recipe converts a Pythia-1B Transformer into a Mamba model that preserves performance with perplexity 14.11 versus the teacher's 13.86.

  22. Kimi Linear: An Expressive, Efficient Attention Architecture

    cs.CL 2025-10 unverdicted novelty 6.0

    Kimi Linear hybridizes linear attention with a new KDA module to beat full attention on tasks while slashing KV cache by 75% and speeding decoding up to 6x.

  23. Mela: Test-Time Memory Consolidation based on Transformation Hypothesis

    cs.CL 2026-05 unverdicted novelty 5.0

    Mela is a Transformer variant with a dual-frequency Hierarchical Memory Module and MemStack that performs test-time memory consolidation, outperforming baselines on long contexts.

  24. Let ViT Speak: Generative Language-Image Pre-training

    cs.CV 2026-05 unverdicted novelty 5.0

    GenLIP pretrains ViTs to generate language tokens from visual tokens via autoregressive language modeling, matching strong baselines on multimodal tasks with less data.

  25. Heterogeneous Scientific Foundation Model Collaboration

    cs.AI 2026-04 unverdicted novelty 5.0

    Eywa enables language-based agentic AI systems to collaborate with specialized scientific foundation models for improved performance on structured data tasks.

  26. When Does Removing LayerNorm Help? Activation Bounding as a Regime-Dependent Implicit Regularizer

    cs.LG 2026-04 unverdicted novelty 5.0

    DyT improves validation loss 27% at 64M params/1M tokens but worsens it 19% at 118M tokens, with saturation levels predicting the sign of the effect.

  27. Gated Memory Policy

    cs.RO 2026-04 unverdicted novelty 5.0

    GMP selectively activates and represents memory via a gate and lightweight cross-attention, yielding 30.1% higher success on non-Markovian robotic tasks while staying competitive on Markovian ones.

  28. LACE: Lattice Attention for Cross-thread Exploration

    cs.AI 2026-04 unverdicted novelty 5.0

    LACE enables concurrent reasoning paths in LLMs to interact via lattice attention and a synthetic training pipeline, raising accuracy more than 7 points over independent parallel search.

  29. LACE: Lattice Attention for Cross-thread Exploration

    cs.AI 2026-04 unverdicted novelty 5.0

    LACE adds lattice attention to let parallel LLM reasoning threads interact and correct errors, raising accuracy over 7 points versus standard independent sampling.

  30. MiMo-V2-Flash Technical Report

    cs.CL 2026-01 unverdicted novelty 5.0

    MiMo-V2-Flash is a 309B/15B MoE model trained on 27T tokens with hybrid attention and multi-teacher on-policy distillation that matches larger models like DeepSeek-V3.2 while enabling 2.6x faster decoding via repurpos...

  31. Better Models, Faster Training: Sigmoid Attention for single-cell Foundation Models

    cs.LG 2026-04 unverdicted novelty 4.0

    Sigmoid attention replaces softmax in single-cell foundation models to deliver better representations, faster training, and stability, backed by bounded derivatives, diagonal Jacobian, and a new efficient GPU kernel.

  32. Learning-Based Spectrum Cartography in Low Earth Orbit Satellite Networks: An Overview

    cs.NI 2026-05 unverdicted novelty 3.0

    The paper overviews attention-based learning methods for spectrum cartography in LEO satellite networks to enable adaptive fusion of heterogeneous measurements for inference and resource allocation.

  33. A Cellular Doctrine of Morality: Intrinsic Active Precision and the Mind-Reality Overload Dilemma

    cs.AI 2026-05 unverdicted novelty 3.0

    AI incorporating active precision from pyramidal neurons may reduce information overload by evaluating evidence coherence before attention rather than maximizing rewards.

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · cited by 30 Pith papers · 16 internal anchors

  1. [1]

    GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

    JoshuaAinslie,JamesLee-Thorp,MichielDeJong,YuryZemlyanskiy,FedericoLebrón,andSumitSanghai. Gqa: Traininggeneralizedmulti-querytransformermodelsfrommulti-headcheckpoints. arXivpreprint arXiv:2305.13245,

  2. [2]

    Numerical error analysis of large language models.arXiv preprint arXiv:2503.10251,

    Stanislav Budzinskiy, Wenyi Fang, Longbin Zeng, and Philipp Petersen. Numerical error analysis of large language models.arXiv preprint arXiv:2503.10251,

  3. [3]

    Extending Context Window of Large Language Models via Positional Interpolation

    URLhttps://arxiv.org/abs/2306.15595. Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways.Journal of Machine Learning Research, 24(240):1–113,

  4. [4]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168,

  5. [5]

    Approximating two-layer feedforward networks for efficient transformers.arXiv preprint arXiv:2310.10837,

    Robert Csordas, Kazuki Irie, and Jurgen Schmidhuber. Approximating two-layer feedforward networks for efficient transformers.arXiv preprint arXiv:2310.10837,

  6. [6]

    Moeut: Mixture-of-experts universal transformers.arXiv preprint arXiv:2405.16039, 2024a

    RobertCsordas, KazukiIrie, JurgenSchmidhuber, ChristopherPotts, andChristopherDManning. Moeut: Mixture-of-experts universal transformers.arXiv preprint arXiv:2405.16039, 2024a. RobertCsordas,PiotrPiekos,KazukiIrie,andJurgenSchmidhuber.Switchhead: Acceleratingtransformers with mixture-of-experts attention.Advances in Neural Information Processing Systems, ...

  7. [7]

    Transformers are ssms: Generalized models and efficient algorithms through structured state space duality

    Tri Dao and Albert Gu. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality. InForty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27,

  8. [8]

    Vision Transformers Need Registers

    URLhttps://openreview.net/forum?id=zt n8FCR1td. 10 TimothéeDarcet,MaximeOquab,JulienMairal, andPiotrBojanowski. Visiontransformersneedregisters. arXiv preprint arXiv:2309.16588,

  9. [10]

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al- Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al

    URLhttps://arxiv.org/abs/2502.07365. Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al- Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783,

  10. [11]

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces.arXiv preprint arXiv:2312.00752,

  11. [12]

    When attention sink emerges in language models: An empirical view.arXiv preprint arXiv:2410.10781,

    Xiangming Gu, Tianyu Pang, Chao Du, Qian Liu, Fengzhuo Zhang, Cunxiao Du, Ye Wang, and Min Lin. When attention sink emerges in language models: An empirical view.arXiv preprint arXiv:2410.10781,

  12. [13]

    Measuring Massive Multitask Language Understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Stein- hardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300,

  13. [14]

    RULER: What's the Real Context Size of Your Long-Context Language Models?

    Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. Ruler: What’s the real context size of your long-context language models?arXiv preprint arXiv:2404.06654,

  14. [15]

    Transformerqualityinlineartime

    WeizheHua,ZihangDai,HanxiaoLiu,andQuocV.Le. Transformerqualityinlineartime. In International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pp. 9099–9117. PMLR,

  15. [16]

    net/forum?id=HyUNwulC-

    Aonian Li, Bangwei Gong, Bo Yang, Boji Shan, Chang Liu, Cheng Zhu, Chunhao Zhang, Congchao Guo, DaChen,DongLi,etal. Minimax-01: Scalingfoundationmodelswithlightningattention. arXivpreprint arXiv:2501.08313,

  16. [17]

    Forgetting transformer: Softmax attention with a forget gate

    Zhixuan Lin, Evgenii Nikishin, Xu Owen He, and Aaron Courville. Forgetting transformer: Softmax attention with a forget gate.arXiv preprint arXiv:2503.02130,

  17. [18]

    An empirical model of large-batch training, 2018, arXiv:1812.06162 http://arxiv.org/abs/arXiv:1812.06162

    SamMcCandlish,JaredKaplan,DarioAmodei,andOpenAIDotaTeam.Anempiricalmodeloflarge-batch training. arXiv preprint arXiv:1812.06162,

  18. [19]

    On the Number of Linear Regions of Deep Neural Networks.arXiv2014, arXiv:1402.1869

    URLhttps://arxiv.org/abs/1402.1869. Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, et al. In-context learning and induction heads.arXiv preprint arXiv:2209.11895,

  19. [20]

    YaRN: Efficient Context Window Extension of Large Language Models

    Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. Yarn: Efficient context window extension of large language models.arXiv preprint arXiv:2309.00071,

  20. [21]

    Mixture of Sparse Attention: Content-Based Learnable Sparse Attention via Expert-Choice Routing

    URLhttps://arxiv.org/abs/2505.00315. Zhen Qin, Weigao Sun, Dong Li, Xuyang Shen, Weixuan Sun, and Yiran Zhong. Lightning attention- 2: A free lunch for handling unlimited sequence lengths in large language models.arXiv preprint arXiv:2401.04658, 2024a. Zhen Qin, Weigao Sun, Dong Li, Xuyang Shen, Weixuan Sun, and Yiran Zhong. Various lengths, constant spee...

  21. [22]

    Jason Ramapuram, Federico Danieli, Eeshan Dhekane, Floris Weers, Dan Busbridge, Pierre Ablin, Tatiana Likhomanenko, Jagrit Digani, Zijin Gu, Amitis Shidani, et al

    URLhttps://arxiv.org/abs/2501.11873. Jason Ramapuram, Federico Danieli, Eeshan Dhekane, Floris Weers, Dan Busbridge, Pierre Ablin, Tatiana Likhomanenko, Jagrit Digani, Zijin Gu, Amitis Shidani, et al. Theory, analysis, and best practices for sigmoid self-attention.arXiv preprint arXiv:2409.04431,

  22. [23]

    GLU Variants Improve Transformer

    Noam Shazeer. Glu variants improve transformer.arXiv preprint arXiv:2002.05202,

  23. [24]

    Highway Networks

    Rupesh Kumar Srivastava, Klaus Greff, and Jürgen Schmidhuber. Highway networks.arXiv preprint arXiv:1505.00387,

  24. [25]

    Z., and Liu, Z

    Mingjie Sun, Xinlei Chen, J Zico Kolter, and Zhuang Liu. Massive activations in large language models. arXiv preprint arXiv:2402.17762,

  25. [26]

    Retentive Network: A Successor to Transformer for Large Language Models

    URLhttps: //arxiv.org/abs/2307.08621. Sho Takase, Shun Kiyono, Sosuke Kobayashi, and Jun Suzuki. Spike no more: Stabilizing the pre-training of large language models.arXiv preprint arXiv:2312.16903,

  26. [27]

    Analyzing multi-head self- attention: Specializedheadsdotheheavylifting,therestcanbepruned

    Elena Voita, David Talbot, Fedor Moiseev, Rico Sennrich, and Ivan Titov. Analyzing multi-head self- attention: Specializedheadsdotheheavylifting,therestcanbepruned. arXivpreprintarXiv:1905.09418 ,

  27. [28]

    Deepnet: Scaling transformers to 1,000 layers.arXiv preprint arXiv:2203.00555, 2022

    URLhttps://arxiv.org/abs/2203.00555. Zekun Wang, Jingchang Chen, Wangchunshu Zhou, Haichao Zhu, Jiafeng Liang, Liping Shan, Ming Liu, Dongliang Xu, Qing Yang, and Bing Qin. Smarttrim: Adaptive tokens and attention pruning for efficient vision-language models.arXiv preprint arXiv:2305.15033,

  28. [29]

    Efficient Streaming Language Models with Attention Sinks

    Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks.arXiv preprint arXiv:2309.17453,

  29. [30]

    An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report.arXiv preprint arXiv:2412.15115, 2024a. Songlin Yang, Jan Kautz, and Ali Hatamizadeh. Gated delta networks: Improving mamba2 with delta rule. arXiv preprint arXiv:2412.06464, 2024b. Tianzhu Ye, Li Dong,...

  30. [31]

    Interpretingtherepeated token phenomenon in large language models.arXiv preprint arXiv:2503.08908,

    ItayYona,IliaShumailov,JamieHayes,FedericoBarbero,andYossiGandelsman. Interpretingtherepeated token phenomenon in large language models.arXiv preprint arXiv:2503.08908,

  31. [32]

    Native sparse attention: Hardware- aligned and natively trainable sparse attention.arXiv preprint arXiv:2502.11089, 2025

    Jingyang Yuan, Huazuo Gao, Damai Dai, Junyu Luo, Liang Zhao, Zhengyan Zhang, Zhenda Xie, YX Wei, Lean Wang, Zhiping Xiao, et al. Native sparse attention: Hardware-aligned and natively trainable sparse attention.arXiv preprint arXiv:2502.11089,

  32. [33]

    HellaSwag: Can a Machine Really Finish Your Sentence?

    12 Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence?arXiv preprint arXiv:1905.07830,

  33. [34]

    GLM-130B: An Open Bilingual Pre-trained Model

    AohanZeng,XiaoLiu,ZhengxiaoDu,ZihanWang,HanyuLai,MingDing,ZhuoyiYang,YifanXu,Wendi Zheng, Xiao Xia, et al. Glm-130b: An open bilingual pre-trained model.arXiv preprint arXiv:2210.02414,

  34. [35]

    ST-MoE: Designing Stable and Transferable Sparse Expert Models

    BarretZoph,IrwanBello,SameerKumar,NanDu,YanpingHuang,JeffDean,NoamShazeer,andWilliam Fedus. St-moe: Designing stable and transferable sparse expert models.arXiv preprint arXiv:2202.08906,

  35. [36]

    Softpick: No Attention Sink, No Massive Activations with Rectified Softmax

    URLhttps://arxiv.org/abs/2504.20966. A Supplement Experiments A.1 Switch Head Baselines In this section, we present detailed experiments related to Switch Heads. The Switch Head paper demonstrates that introducing sparse activation in attention—where each token selects the top-k experts from a pool of key/value/output experts via learnable sigmoid routing...