Recognition: 2 theorem links
· Lean TheoremGated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free
Pith reviewed 2026-05-12 08:59 UTC · model grok-4.3
The pith
A head-specific sigmoid gate after scaled dot-product attention improves large language model performance, training stability, and long-context handling.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Our central finding is that a simple modification—applying a head-specific sigmoid gate after the Scaled Dot-Product Attention (SDPA)—consistently improves performance. This modification also enhances training stability, tolerates larger learning rates, and improves scaling properties. By comparing various gating positions and computational variants, we attribute this effectiveness to two key factors: (1) introducing non-linearity upon the low-rank mapping in the softmax attention, and (2) applying query-dependent sparse gating scores to modulate the SDPA output. Notably, we find this sparse gating mechanism mitigates 'attention sink' and enhances long-context extrapolation performance.
What carries the argument
head-specific sigmoid gate applied after scaled dot-product attention, introducing non-linearity and query-dependent sparsity
If this is right
- Training runs become more stable and tolerate larger learning rates without divergence.
- Models exhibit improved scaling trends when trained on larger datasets and bigger parameter counts.
- Attention sink is reduced, yielding better performance on long sequences without additional positional fixes.
- The gains hold for both dense models and mixture-of-experts architectures.
- The same gating position and form can be compared against other placements such as before the attention computation.
Where Pith is reading between the lines
- The query-dependent sparsity might be combined with token-pruning methods to lower inference cost on very long inputs.
- The added non-linearity could be tested in linear-attention or state-space models to check whether similar gains appear outside softmax attention.
- Releasing the code and models allows direct replication and extension to new architectures or training regimes.
- Future scaling studies could measure whether the improved scaling slope persists at even larger model sizes.
Load-bearing premise
That the performance gains and attention-sink mitigation arise specifically from the non-linearity and query-dependent sparsity of the sigmoid gate rather than from the added parameters or other uncontrolled factors in the 30-variant experiments.
What would settle it
Train matched models that keep the same parameter count but replace the sigmoid gate with a linear function or make the gate query-independent; check whether the performance, stability, and sink-mitigation advantages disappear.
read the original abstract
Gating mechanisms have been widely utilized, from early models like LSTMs and Highway Networks to recent state space models, linear attention, and also softmax attention. Yet, existing literature rarely examines the specific effects of gating. In this work, we conduct comprehensive experiments to systematically investigate gating-augmented softmax attention variants. Specifically, we perform a comprehensive comparison over 30 variants of 15B Mixture-of-Experts (MoE) models and 1.7B dense models trained on a 3.5 trillion token dataset. Our central finding is that a simple modification-applying a head-specific sigmoid gate after the Scaled Dot-Product Attention (SDPA)-consistently improves performance. This modification also enhances training stability, tolerates larger learning rates, and improves scaling properties. By comparing various gating positions and computational variants, we attribute this effectiveness to two key factors: (1) introducing non-linearity upon the low-rank mapping in the softmax attention, and (2) applying query-dependent sparse gating scores to modulate the SDPA output. Notably, we find this sparse gating mechanism mitigates 'attention sink' and enhances long-context extrapolation performance, and we also release related $\href{https://github.com/qiuzh20/gated_attention}{codes}$ and $\href{https://huggingface.co/QwQZh/gated_attention}{models}$ to facilitate future research.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that applying a head-specific sigmoid gate after the Scaled Dot-Product Attention (SDPA) in softmax attention consistently improves performance in large language models. This is supported by training 30 variants of 15B MoE and 1.7B dense models on 3.5 trillion tokens, showing benefits in performance, stability, learning rate tolerance, and scaling. The effectiveness is attributed to non-linearity on the low-rank mapping and query-dependent sparse gating, which mitigates attention sinks and improves long-context performance. Code and models are released.
Significance. The large-scale empirical evaluation across model scales and architectures provides substantial support for a simple modification to the attention mechanism. The release of code and models aids reproducibility. If the gains are specifically due to the claimed non-linearity and sparsity rather than incidental factors, this could influence future LLM designs by improving stability and long-context capabilities with minimal overhead.
major comments (1)
- [Ablation studies and variant comparisons] Ablation studies and variant comparisons: The paper compares gating positions and computational variants to attribute gains to non-linearity and query-dependent sparsity. However, without explicit parameter-count-matched or FLOPs-matched baselines (e.g., adding dummy learnable parameters or fixed scalers to vanilla SDPA), the attribution remains open to the possibility that improvements arise from added capacity or other uncontrolled aspects of the 30-variant setup rather than the specific mechanisms. This is load-bearing for the central attribution claim.
minor comments (2)
- [Abstract] The abstract would benefit from briefly noting the exact baselines, any statistical tests, and key ablation controls to better convey the experimental rigor upfront.
- [Figures] Ensure all figures clearly label the 30 variants and include error bars or multiple runs where performance differences are reported.
Simulated Author's Rebuttal
Thank you for the detailed review and the recommendation for minor revision. We address the major comment regarding ablation studies and variant comparisons below.
read point-by-point responses
-
Referee: The paper compares gating positions and computational variants to attribute gains to non-linearity and query-dependent sparsity. However, without explicit parameter-count-matched or FLOPs-matched baselines (e.g., adding dummy learnable parameters or fixed scalers to vanilla SDPA), the attribution remains open to the possibility that improvements arise from added capacity or other uncontrolled aspects of the 30-variant setup rather than the specific mechanisms. This is load-bearing for the central attribution claim.
Authors: We thank the referee for highlighting this important point on controlling for model capacity. Our 30 variants include multiple gating positions (pre- and post-SDPA) and computational forms (e.g., different ways to compute the gate), all of which introduce comparable numbers of additional parameters. Notably, only the post-SDPA head-specific sigmoid consistently yields improvements across metrics, while other variants with similar parameter overhead do not. This differential effect supports that the gains stem from the specific non-linearity and query-dependent sparsity rather than capacity alone. Additionally, the observed benefits in training stability, higher learning rate tolerance, and mitigation of attention sinks are difficult to explain by parameter count increases alone. Nevertheless, to further strengthen the claim, we will add parameter-matched baselines using dummy learnable parameters or fixed scalers in the revised version, at least for the smaller 1.7B dense model scale where retraining is more feasible. revision: yes
Circularity Check
No circularity: empirical results from direct model training and variant comparisons.
full rationale
The paper presents no derivation chain, first-principles prediction, or mathematical reduction. Its central claims—that a head-specific sigmoid gate after SDPA improves performance, stability, LR tolerance, and long-context behavior—are supported solely by training 30 variants of 15B MoE and 1.7B dense models on 3.5T tokens and measuring outcomes. Attribution to non-linearity and query-dependent sparsity is made by comparing gating positions and computational forms within the same experimental setup; these are independent empirical tests, not tautologies or fits renamed as predictions. No self-citation is load-bearing for the results, and no equation or claim reduces to its own inputs by construction. The work is self-contained against external benchmarks via released code and models.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math Standard scaled dot-product attention and mixture-of-experts training procedures function as described in prior literature
Forward citations
Cited by 33 Pith papers
-
GIANTS: Generative Insight Anticipation from Scientific Literature
GIANTS-4B, trained with RL on a new 17k-example benchmark of parent-to-child paper insights, achieves 34% relative improvement over gemini-3-pro in LM-judge similarity and is rated higher-impact by a citation predictor.
-
A Single Layer to Explain Them All:Understanding Massive Activations in Large Language Models
Massive activations originate in a specific ME Layer across LLM families; reducing their token rigidity via a targeted method boosts performance and mitigates attention sinks.
-
A Single Layer to Explain Them All:Understanding Massive Activations in Large Language Models
Massive activations first appear in a single ME Layer due to RMSNorm and FFN, remain invariant thereafter, and a simple softening method raises LLM performance while reducing attention sinks.
-
FLUID: Continuous-Time Hyperconnected Sparse Transformer for Sink-Free Learning
FLUID is a continuous-time transformer using Liquid Attention Networks to model attention as stable ODE solutions that interpolate between discrete SDPA and CT-RNNs, with an explicit sink gate and liquid hyper-connect...
-
Degradation-Aware Adaptive Context Gating for Unified Image Restoration
DACG-IR adds a lightweight degradation-aware module that generates prompts to adaptively gate attention temperature, output features, and spatial-channel fusion in an encoder-decoder network for unified image restoration.
-
TokenFormer: Unify the Multi-Field and Sequential Recommendation Worlds
TokenFormer unifies multi-field and sequential recommendation modeling via bottom-full-top-sliding attention and non-linear interaction representations to avoid sequential collapse and deliver state-of-the-art performance.
-
Gradient Boosting within a Single Attention Layer
Gradient-boosted attention applies a corrective second attention pass within a single layer, mapping to Friedman's gradient boosting and improving perplexity by 5.6-6.0% on WikiText-103 and OpenWebText subsets over st...
-
RigidFormer: Learning Rigid Dynamics using Transformers
RigidFormer learns mesh-free rigid dynamics from point clouds using object-centric anchors, Anchor-Vertex Pooling, Anchor-based RoPE, and differentiable Kabsch alignment to enforce rigidity.
-
SlimQwen: Exploring the Pruning and Distillation in Large MoE Model Pre-training
Pruning pretrained MoE models outperforms training from scratch, different compression methods converge after continued pretraining, and combining KD with language modeling loss plus progressive schedules yields a com...
-
GEM: Generating LiDAR World Model via Deformable Mamba
GEM is a new LiDAR world model using deformable Mamba that disentangles dynamic and static features to generate high-fidelity simulations and achieve state-of-the-art results on autonomous driving benchmarks.
-
The Structural Origin of Attention Sink: Variance Discrepancy, Super Neurons, and Dimension Disparity
Attention sinks arise from variance discrepancy in self-attention value aggregation, amplified by super neurons and first-token dimension disparity, and can be mitigated by head-wise RMSNorm to accelerate pre-training...
-
Cubit: Token Mixer with Kernel Ridge Regression
Cubit replaces Transformer attention with Kernel Ridge Regression token mixing and shows potential gains on longer sequences.
-
ZAYA1-8B Technical Report
ZAYA1-8B is a reasoning MoE model with 700M active parameters that matches larger models on math and coding benchmarks and reaches 91.9% on AIME'25 via Markovian RSA test-time compute.
-
HELIX: Hybrid Encoding with Learnable Identity and Cross-dimensional Synthesis for Time Series Imputation
HELIX uses learnable feature identities and hybrid temporal-feature attention to achieve state-of-the-art time series imputation across multiple datasets and settings.
-
Long-Context Aware Upcycling: A New Frontier for Hybrid LLM Scaling
HyLo upcycles Transformer LLMs into hybrids with MLA and Mamba2/Gated DeltaNet blocks via staged training and distillation, extending context to 2M tokens and outperforming prior upcycled hybrids on long-context benchmarks.
-
SinkRouter: Sink-Aware Routing for Efficient Long-Context Decoding in Large Language and Multimodal Models
SinkRouter identifies attention sinks as training-derived fixed points and routes around them to skip redundant KV-cache loads, delivering up to 2.03x decoding speedup on long-context benchmarks.
-
LACE: Lattice Attention for Cross-thread Exploration
LACE enables parallel reasoning paths in LLMs to communicate via lattice attention and error-correct using synthetic training data, improving accuracy by over 7 points over standard parallel search.
-
Long-Horizon Streaming Video Generation via Hybrid Attention with Decoupled Distillation
Hybrid Forcing combines linear temporal attention for long-range retention, block-sparse attention for efficiency, and decoupled distillation to achieve real-time unbounded 832x480 streaming video generation at 29.5 FPS.
-
Attention Editing: A Versatile Framework for Cross-Architecture Attention Conversion
Attention Editing converts pre-trained LLMs to new attention architectures through layer-wise teacher-forced optimization and model-level distillation, preserving performance with efficiency gains.
-
LSRM: High-Fidelity Object-Centric Reconstruction via Scaled Context Windows
LSRM scales transformer context windows with native sparse attention and geometric routing to deliver high-fidelity feed-forward 3D reconstruction and inverse rendering that approaches dense optimization quality.
-
Attention to Mamba: A Recipe for Cross-Architecture Distillation
A two-stage distillation recipe converts a Pythia-1B Transformer into a Mamba model that preserves performance with perplexity 14.11 versus the teacher's 13.86.
-
Kimi Linear: An Expressive, Efficient Attention Architecture
Kimi Linear hybridizes linear attention with a new KDA module to beat full attention on tasks while slashing KV cache by 75% and speeding decoding up to 6x.
-
Mela: Test-Time Memory Consolidation based on Transformation Hypothesis
Mela is a Transformer variant with a dual-frequency Hierarchical Memory Module and MemStack that performs test-time memory consolidation, outperforming baselines on long contexts.
-
Let ViT Speak: Generative Language-Image Pre-training
GenLIP pretrains ViTs to generate language tokens from visual tokens via autoregressive language modeling, matching strong baselines on multimodal tasks with less data.
-
Heterogeneous Scientific Foundation Model Collaboration
Eywa enables language-based agentic AI systems to collaborate with specialized scientific foundation models for improved performance on structured data tasks.
-
When Does Removing LayerNorm Help? Activation Bounding as a Regime-Dependent Implicit Regularizer
DyT improves validation loss 27% at 64M params/1M tokens but worsens it 19% at 118M tokens, with saturation levels predicting the sign of the effect.
-
Gated Memory Policy
GMP selectively activates and represents memory via a gate and lightweight cross-attention, yielding 30.1% higher success on non-Markovian robotic tasks while staying competitive on Markovian ones.
-
LACE: Lattice Attention for Cross-thread Exploration
LACE enables concurrent reasoning paths in LLMs to interact via lattice attention and a synthetic training pipeline, raising accuracy more than 7 points over independent parallel search.
-
LACE: Lattice Attention for Cross-thread Exploration
LACE adds lattice attention to let parallel LLM reasoning threads interact and correct errors, raising accuracy over 7 points versus standard independent sampling.
-
MiMo-V2-Flash Technical Report
MiMo-V2-Flash is a 309B/15B MoE model trained on 27T tokens with hybrid attention and multi-teacher on-policy distillation that matches larger models like DeepSeek-V3.2 while enabling 2.6x faster decoding via repurpos...
-
Better Models, Faster Training: Sigmoid Attention for single-cell Foundation Models
Sigmoid attention replaces softmax in single-cell foundation models to deliver better representations, faster training, and stability, backed by bounded derivatives, diagonal Jacobian, and a new efficient GPU kernel.
-
Learning-Based Spectrum Cartography in Low Earth Orbit Satellite Networks: An Overview
The paper overviews attention-based learning methods for spectrum cartography in LEO satellite networks to enable adaptive fusion of heterogeneous measurements for inference and resource allocation.
-
A Cellular Doctrine of Morality: Intrinsic Active Precision and the Mind-Reality Overload Dilemma
AI incorporating active precision from pyramidal neurons may reduce information overload by evaluating evidence coherence before attention rather than maximizing rewards.
Reference graph
Works this paper leans on
-
[1]
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
JoshuaAinslie,JamesLee-Thorp,MichielDeJong,YuryZemlyanskiy,FedericoLebrón,andSumitSanghai. Gqa: Traininggeneralizedmulti-querytransformermodelsfrommulti-headcheckpoints. arXivpreprint arXiv:2305.13245,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Numerical error analysis of large language models.arXiv preprint arXiv:2503.10251,
Stanislav Budzinskiy, Wenyi Fang, Longbin Zeng, and Philipp Petersen. Numerical error analysis of large language models.arXiv preprint arXiv:2503.10251,
-
[3]
Extending Context Window of Large Language Models via Positional Interpolation
URLhttps://arxiv.org/abs/2306.15595. Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways.Journal of Machine Learning Research, 24(240):1–113,
work page internal anchor Pith review arXiv
-
[4]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Robert Csordas, Kazuki Irie, and Jurgen Schmidhuber. Approximating two-layer feedforward networks for efficient transformers.arXiv preprint arXiv:2310.10837,
-
[6]
Moeut: Mixture-of-experts universal transformers.arXiv preprint arXiv:2405.16039, 2024a
RobertCsordas, KazukiIrie, JurgenSchmidhuber, ChristopherPotts, andChristopherDManning. Moeut: Mixture-of-experts universal transformers.arXiv preprint arXiv:2405.16039, 2024a. RobertCsordas,PiotrPiekos,KazukiIrie,andJurgenSchmidhuber.Switchhead: Acceleratingtransformers with mixture-of-experts attention.Advances in Neural Information Processing Systems, ...
-
[7]
Tri Dao and Albert Gu. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality. InForty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27,
work page 2024
-
[8]
Vision Transformers Need Registers
URLhttps://openreview.net/forum?id=zt n8FCR1td. 10 TimothéeDarcet,MaximeOquab,JulienMairal, andPiotrBojanowski. Visiontransformersneedregisters. arXiv preprint arXiv:2309.16588,
work page internal anchor Pith review arXiv
-
[10]
URLhttps://arxiv.org/abs/2502.07365. Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al- Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783,
-
[11]
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces.arXiv preprint arXiv:2312.00752,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
When attention sink emerges in language models: An empirical view.arXiv preprint arXiv:2410.10781,
Xiangming Gu, Tianyu Pang, Chao Du, Qian Liu, Fengzhuo Zhang, Cunxiao Du, Ye Wang, and Min Lin. When attention sink emerges in language models: An empirical view.arXiv preprint arXiv:2410.10781,
-
[13]
Measuring Massive Multitask Language Understanding
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Stein- hardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300,
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[14]
RULER: What's the Real Context Size of Your Long-Context Language Models?
Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. Ruler: What’s the real context size of your long-context language models?arXiv preprint arXiv:2404.06654,
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
Transformerqualityinlineartime
WeizheHua,ZihangDai,HanxiaoLiu,andQuocV.Le. Transformerqualityinlineartime. In International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pp. 9099–9117. PMLR,
work page 2022
-
[16]
Aonian Li, Bangwei Gong, Bo Yang, Boji Shan, Chang Liu, Cheng Zhu, Chunhao Zhang, Congchao Guo, DaChen,DongLi,etal. Minimax-01: Scalingfoundationmodelswithlightningattention. arXivpreprint arXiv:2501.08313,
-
[17]
Forgetting transformer: Softmax attention with a forget gate
Zhixuan Lin, Evgenii Nikishin, Xu Owen He, and Aaron Courville. Forgetting transformer: Softmax attention with a forget gate.arXiv preprint arXiv:2503.02130,
-
[18]
SamMcCandlish,JaredKaplan,DarioAmodei,andOpenAIDotaTeam.Anempiricalmodeloflarge-batch training. arXiv preprint arXiv:1812.06162,
-
[19]
On the Number of Linear Regions of Deep Neural Networks.arXiv2014, arXiv:1402.1869
URLhttps://arxiv.org/abs/1402.1869. Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, et al. In-context learning and induction heads.arXiv preprint arXiv:2209.11895,
-
[20]
YaRN: Efficient Context Window Extension of Large Language Models
Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. Yarn: Efficient context window extension of large language models.arXiv preprint arXiv:2309.00071,
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
Mixture of Sparse Attention: Content-Based Learnable Sparse Attention via Expert-Choice Routing
URLhttps://arxiv.org/abs/2505.00315. Zhen Qin, Weigao Sun, Dong Li, Xuyang Shen, Weixuan Sun, and Yiran Zhong. Lightning attention- 2: A free lunch for handling unlimited sequence lengths in large language models.arXiv preprint arXiv:2401.04658, 2024a. Zhen Qin, Weigao Sun, Dong Li, Xuyang Shen, Weixuan Sun, and Yiran Zhong. Various lengths, constant spee...
-
[22]
URLhttps://arxiv.org/abs/2501.11873. Jason Ramapuram, Federico Danieli, Eeshan Dhekane, Floris Weers, Dan Busbridge, Pierre Ablin, Tatiana Likhomanenko, Jagrit Digani, Zijin Gu, Amitis Shidani, et al. Theory, analysis, and best practices for sigmoid self-attention.arXiv preprint arXiv:2409.04431,
-
[23]
GLU Variants Improve Transformer
Noam Shazeer. Glu variants improve transformer.arXiv preprint arXiv:2002.05202,
work page internal anchor Pith review Pith/arXiv arXiv 2002
-
[24]
Rupesh Kumar Srivastava, Klaus Greff, and Jürgen Schmidhuber. Highway networks.arXiv preprint arXiv:1505.00387,
-
[25]
Mingjie Sun, Xinlei Chen, J Zico Kolter, and Zhuang Liu. Massive activations in large language models. arXiv preprint arXiv:2402.17762,
-
[26]
Retentive Network: A Successor to Transformer for Large Language Models
URLhttps: //arxiv.org/abs/2307.08621. Sho Takase, Shun Kiyono, Sosuke Kobayashi, and Jun Suzuki. Spike no more: Stabilizing the pre-training of large language models.arXiv preprint arXiv:2312.16903,
work page internal anchor Pith review Pith/arXiv arXiv
-
[27]
Analyzing multi-head self- attention: Specializedheadsdotheheavylifting,therestcanbepruned
Elena Voita, David Talbot, Fedor Moiseev, Rico Sennrich, and Ivan Titov. Analyzing multi-head self- attention: Specializedheadsdotheheavylifting,therestcanbepruned. arXivpreprintarXiv:1905.09418 ,
-
[28]
Deepnet: Scaling transformers to 1,000 layers.arXiv preprint arXiv:2203.00555, 2022
URLhttps://arxiv.org/abs/2203.00555. Zekun Wang, Jingchang Chen, Wangchunshu Zhou, Haichao Zhu, Jiafeng Liang, Liping Shan, Ming Liu, Dongliang Xu, Qing Yang, and Bing Qin. Smarttrim: Adaptive tokens and attention pruning for efficient vision-language models.arXiv preprint arXiv:2305.15033,
-
[29]
Efficient Streaming Language Models with Attention Sinks
Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks.arXiv preprint arXiv:2309.17453,
work page internal anchor Pith review Pith/arXiv arXiv
-
[30]
An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report.arXiv preprint arXiv:2412.15115, 2024a. Songlin Yang, Jan Kautz, and Ali Hatamizadeh. Gated delta networks: Improving mamba2 with delta rule. arXiv preprint arXiv:2412.06464, 2024b. Tianzhu Ye, Li Dong,...
work page internal anchor Pith review Pith/arXiv arXiv
-
[31]
Interpretingtherepeated token phenomenon in large language models.arXiv preprint arXiv:2503.08908,
ItayYona,IliaShumailov,JamieHayes,FedericoBarbero,andYossiGandelsman. Interpretingtherepeated token phenomenon in large language models.arXiv preprint arXiv:2503.08908,
-
[32]
Jingyang Yuan, Huazuo Gao, Damai Dai, Junyu Luo, Liang Zhao, Zhengyan Zhang, Zhenda Xie, YX Wei, Lean Wang, Zhiping Xiao, et al. Native sparse attention: Hardware-aligned and natively trainable sparse attention.arXiv preprint arXiv:2502.11089,
-
[33]
HellaSwag: Can a Machine Really Finish Your Sentence?
12 Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence?arXiv preprint arXiv:1905.07830,
work page internal anchor Pith review Pith/arXiv arXiv 1905
-
[34]
GLM-130B: An Open Bilingual Pre-trained Model
AohanZeng,XiaoLiu,ZhengxiaoDu,ZihanWang,HanyuLai,MingDing,ZhuoyiYang,YifanXu,Wendi Zheng, Xiao Xia, et al. Glm-130b: An open bilingual pre-trained model.arXiv preprint arXiv:2210.02414,
work page internal anchor Pith review arXiv
-
[35]
ST-MoE: Designing Stable and Transferable Sparse Expert Models
BarretZoph,IrwanBello,SameerKumar,NanDu,YanpingHuang,JeffDean,NoamShazeer,andWilliam Fedus. St-moe: Designing stable and transferable sparse expert models.arXiv preprint arXiv:2202.08906,
work page internal anchor Pith review arXiv
-
[36]
Softpick: No Attention Sink, No Massive Activations with Rectified Softmax
URLhttps://arxiv.org/abs/2504.20966. A Supplement Experiments A.1 Switch Head Baselines In this section, we present detailed experiments related to Switch Heads. The Switch Head paper demonstrates that introducing sparse activation in attention—where each token selects the top-k experts from a pool of key/value/output experts via learnable sigmoid routing...
work page internal anchor Pith review Pith/arXiv arXiv
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.