arxiv: 2604.11321 · v1 · submitted 2026-04-13 · 💻 cs.NE

Recognition: unknown

Winner-Take-All Spiking Transformer for Language Modeling

Chenlin Zhou , Sihang Guo , Jiaqi Wang , Dongyang Ma , Kaiwei Che , Baiyu Chen , Qingyan Meng , Zhengyu Ma , Yonghong Tian

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:04 UTC · model grok-4.3

classification 💻 cs.NE

keywords spiking neural networkstransformerwinner-take-allself-attentionlanguage modelingneuromorphic computingenergy efficiencyspiking transformers

0 comments

The pith

Winner-take-all mechanisms let spiking transformers handle language modeling without softmax.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that winner-take-all selection can replace softmax in spiking self-attention, creating fully spike-driven modules that support both masked and causal language modeling. This removes a major energy bottleneck and convergence barrier that previously limited spiking transformers to vision tasks. The resulting encoder and decoder architectures train end-to-end and deliver competitive results across understanding, question answering, and reasoning benchmarks. If the approach holds, it makes the scalability of transformers compatible with the sparse, low-power operation of spiking networks. The work therefore points toward language models that run efficiently on neuromorphic hardware.

Core claim

The central claim is that winner-take-all spiking self-attention modules can perform the role of conventional attention without softmax normalization or dense computations. The authors introduce WTA Spiking Self-Attention (WSSA) for bidirectional contexts and Causal WTA Spiking Self-Attention (CWSSA) for autoregressive settings. These modules underpin the WE-Spikingformer for masked language modeling and the WD-Spikingformer for causal language modeling, both trained directly on text. Experiments on sixteen datasets confirm that the resulting models achieve usable performance on natural language understanding, question answering, and commonsense reasoning tasks.

What carries the argument

The Winner-Take-All Spiking Self-Attention (WSSA) module, which selects the strongest spiking signals to compute attention weights without softmax, together with its causal counterpart CWSSA.

If this is right

Spiking transformers become applicable to general language modeling rather than vision alone.
Energy use drops because both attention and activation are realized with sparse spikes and no softmax.
End-to-end training succeeds on text data, removing the need for separate pre-training or conversion steps.
The architectures are directly deployable on neuromorphic hardware for low-power NLP inference.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same WTA mechanism could be tested on other sequential domains such as audio or time-series forecasting.
Scaling laws for these models remain open; larger WTA spiking transformers might close the remaining performance gap with dense transformers.
Hybrid designs that mix WTA spiking layers with conventional layers could balance efficiency and accuracy for specific tasks.

Load-bearing premise

That winner-take-all selection among spikes can capture the long-range dependencies needed for language without the stabilizing effect of softmax normalization.

What would settle it

Training the WD-Spikingformer on a standard language-modeling corpus and measuring test perplexity; if it remains substantially higher than a comparable non-spiking transformer baseline, the claim that WTA attention suffices would be refuted.

Figures

Figures reproduced from arXiv: 2604.11321 by Baiyu Chen, Chenlin Zhou, Dongyang Ma, Jiaqi Wang, Kaiwei Che, Qingyan Meng, Sihang Guo, Yonghong Tian, Zhengyu Ma.

**Figure 1.** Figure 1: Overview of WTA Spiking Self-Attention (WSSA) and Causal WTA Spiking Self-Attention (CWSSA). The left shows the softmax-based spiking self-attention in SpikeLM (Xing et al., 2024b). The right shows our WSSA in WE-Spikingformer for masked language modeling and CWSSA in WD-Spikingformer for causal language modeling. The symbol “@” denotes matrix multiplication. Top-K WTA. As a sparse neural activation mechan… view at source ↗

**Figure 2.** Figure 2: The overview of WE-Spikingformer and WD-Spikingformer. The left shows WE-Spikingformer (WTA-based Encoder-only Spiking Transformer) for spike-based masked language modeling. The right shows WD-Spikingformer (WTA-based Decoder-only Spiking Transformer) for spike-based causal language modeling. where s is the scaling factor, same in Zhou et al. (2023b), Aw is the attention weights, and SN means the spiking n… view at source ↗

**Figure 3.** Figure 3: Increasing model Parameters for WE-Spikingformer pretraining on (a) Question-Answering Tasks (QAT), and (b) Commonsense Reasoning Tasks (CRT). Unified Spiking Transformer for Event, Image, and Language. (a) (b) (a) (b) +1.9% +0.5% +0.4% +0.6% +1.9% [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Increasing tokens for WE-Spikingformer-1.0B pretraining on (a) Question-Answering Tasks (QAT), and (b) Commonsense Reasoning Tasks (CRT). to 0.5B tokens, the performance of WE-Spikingformer-1.0B achieves a significant performance improvement of 1.9% on QAT, demonstrating that WE-Spikingformer continues to significantly benefit from larger-scale pretraining. In summary, we verified that our model’s perfor… view at source ↗

read the original abstract

Spiking Transformers, which combine the scalability of Transformers with the sparse, energy-efficient property of Spiking Neural Networks (SNNs), have achieved impressive results in neuromorphic and vision tasks and attracted increasing attention. However, existing directly trained spiking transformers primarily focus on vision tasks. For language modeling with spiking transformer, convergence relies heavily on softmax-based spiking self-attention, which incurs high energy costs and poses challenges for neuromorphic deployment. To address this issue, we introduce Winner-Take-All (WTA) mechanisms into spiking transformers and propose two novel softmax-free, spike-driven self-attention modules: WTA Spiking Self-Attention (WSSA) and Causal WTA Spiking Self-Attention (CWSSA). Based on them, we design WTA-based Encoder-only Spiking Transformer (WE-Spikingformer) for masked language modeling and WTA-based Decoder-only Spiking Transformer (WD-Spikingformer) for causal language modeling, systematically exploring softmax-free, spiking-driven Transformer architectures trained end-to-end for natural language processing tasks. Extensive experiments on 16 datasets spanning natural language understanding, question-answering tasks, and commonsense reasoning tasks validate the effectiveness of our approach and highlight the promise of spiking transformers for general language modeling and energy-efficient artificial intelligence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper makes WTA spiking attention work for both masked and causal language modeling, but the results section needs tighter baselines and numbers to show it closes the gap with standard transformers.

read the letter

The core advance is replacing softmax with winner-take-all in spiking self-attention, giving two new modules (WSSA and CWSSA) that let them train encoder-only and decoder-only spiking transformers end-to-end on language tasks. That move from vision-only spiking transformers to NLP is the real novelty, and they back it with runs across 16 datasets covering NLU, QA, and commonsense reasoning. The energy angle is straightforward: no softmax means lower cost on neuromorphic hardware, which matters if these models ever move off GPUs. They also keep the spike-driven property throughout, which is consistent with the SNN goal. The architecture choices look reasonable on paper—causal masking for the decoder variant and the usual spiking neuron layers. What is missing is any sense of how large the performance drop is versus BERT-style or GPT-style baselines, or versus earlier spiking transformers that stayed in vision. Without error bars, ablation tables on the WTA threshold, or direct energy measurements on actual neuromorphic chips, it is hard to judge whether the approach is competitive or just workable. Training stability without softmax normalization is the key assumption, and the abstract does not show whether they needed extra regularization or longer schedules to make it converge. The citation list seems light on recent spiking NLP work, so it is not clear how much they are building on versus starting fresh. This is useful reading for anyone already running spiking networks or neuromorphic hardware experiments; a general NLP group would probably skip it unless the numbers turn out strong. It is worth sending to referees because the direction is clean and the empirical scope is broad, even if the current write-up needs more quantitative grounding to stand up.

Referee Report

0 major / 2 minor

Summary. The paper claims that incorporating Winner-Take-All (WTA) mechanisms into spiking transformers yields two novel softmax-free, spike-driven self-attention modules (WSSA and CWSSA). These enable the construction of an encoder-only WTA-based Spiking Transformer (WE-Spikingformer) for masked language modeling and a decoder-only variant (WD-Spikingformer) for causal language modeling. The approach is validated end-to-end on 16 datasets spanning natural language understanding, question answering, and commonsense reasoning, demonstrating competitive performance while addressing energy and deployment issues associated with softmax in spiking attention.

Significance. If the reported results hold, the work is significant for extending energy-efficient spiking neural networks from vision to general language modeling. It provides a concrete path toward neuromorphic deployment of transformers by removing softmax normalization, which is a load-bearing barrier for sparse, event-driven hardware. The systematic exploration of encoder-only and decoder-only spiking architectures for NLP tasks fills a noted gap in the literature.

minor comments (2)

[Abstract] Abstract: The claim of validation across 16 datasets is stated without any quantitative metrics, baseline comparisons, or statistical details (e.g., accuracy deltas or standard deviations). This weakens the reader's ability to assess competitiveness from the abstract alone and should be augmented with at least one key result per task category.
[Method] The description of WSSA and CWSSA as 'spike-driven' and 'softmax-free' is clear in intent, but the manuscript would benefit from an explicit comparison (perhaps in a table or figure) of spike rates and energy estimates versus prior softmax-based spiking attention to quantify the claimed efficiency gains.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary, significance assessment, and recommendation of minor revision. We appreciate the recognition that our WTA-based approach addresses key barriers to deploying spiking transformers in language modeling tasks.

Circularity Check

0 steps flagged

No significant circularity; empirical architectural proposal

full rationale

The paper introduces WTA-based spiking self-attention modules (WSSA/CWSSA) as a direct architectural substitution for softmax in spiking transformers, then constructs encoder-only and decoder-only variants and validates them experimentally across 16 datasets. No derivation chain, equations, or first-principles predictions are presented that reduce to fitted inputs, self-definitions, or self-citation load-bearing steps. The central claims rest on the empirical performance of the proposed modules rather than any mathematical reduction to prior results or parameters, rendering the work self-contained as an engineering contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the WTA mechanism is introduced as a design choice whose implementation details and any associated hyperparameters are not described.

pith-pipeline@v0.9.0 · 5542 in / 1109 out tokens · 94365 ms · 2026-05-10T15:04:35.012572+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Adaptive Spiking Neurons for Vision and Language Modeling
cs.NE 2026-04 unverdicted novelty 5.0

ASN uses trainable parameters for adaptive membrane dynamics and firing in SNNs, with NASN adding normalization, and reports effectiveness across 19 vision and language datasets.

Reference graph

Works this paper leans on

24 extracted references · 19 canonical work pages · cited by 1 Pith paper · 9 internal anchors

[1]

Agarap, A. F. Deep learning using rectified linear units (relu).arXiv preprint arXiv:1803.08375,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions

Clark, C., Lee, K., Chang, M.-W., Kwiatkowski, T., Collins, M., and Toutanova, K. Boolq: Exploring the surprising difficulty of natural yes/no questions.arXiv preprint arXiv:1905.10044,

work page internal anchor Pith review arXiv 1905
[3]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., and Tafjord, O. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Bert: Pre-training of deep bidirectional transformers for lan- guage understanding

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: Pre-training of deep bidirectional transformers for lan- guage understanding. InProceedings of the 2019 confer- ence of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pp. 4171–4186,

2019
[5]

Fang, Y ., Zhou, D., Wang, Z., Ren, H., Zeng, Z., Li, L., Xu, R., et al

doi: 10.1109/ TCSI.2025.3549060. Fang, Y ., Zhou, D., Wang, Z., Ren, H., Zeng, Z., Li, L., Xu, R., et al. Spiking neural networks need high-frequency information. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems,

work page arXiv 2025
[6]

Gaussian Error Linear Units (GELUs)

Hendrycks, D. and Gimpel, K. Gaussian error linear units (gelus).arXiv preprint arXiv:1606.08415,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Spikepool: Event-driven spiking transformer with pooling attention

Lee, D., Sima, A., Li, Y ., Stinis, P., and Panda, P. Spikepool: Event-driven spiking transformer with pooling attention. arXiv preprint arXiv:2510.12102,

work page arXiv
[8]

Spikebert: A language spikformer learned from bert with knowledge distillation.arXiv preprint arXiv:2308.15122,

Lv, C., Li, T., Xu, J., Gu, C., Ling, Z., Zhang, C., Zheng, X., and Huang, X. Spikebert: A language spikformer learned from bert with knowledge distillation.arXiv preprint arXiv:2308.15122,

work page arXiv
[9]

Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering

Mihaylov, T., Clark, P., Khot, T., and Sabharwal, A. Can a suit of armor conduct electricity? a new dataset for open book question answering.arXiv preprint arXiv:1809.02789,

work page internal anchor Pith review arXiv
[10]

RWKV: Reinventing RNNs for the Transformer Era

9 Submission and Formatting Instructions for ICML 2026 Peng, B., Alcaide, E., Anthony, Q., Albalak, A., Arcadinho, S., Biderman, S., Cao, H., Cheng, X., Chung, M., Grella, M., et al. Rwkv: Reinventing rnns for the transformer era. arXiv preprint arXiv:2305.13048,

work page internal anchor Pith review arXiv 2026
[11]

Bibert: Accurate fully binarized bert,

Qin, H., Ding, Y ., Zhang, M., Yan, Q., Liu, A., Dang, Q., Liu, Z., and Liu, X. Bibert: Accurate fully binarized bert. arXiv preprint arXiv:2203.06390,

work page arXiv
[12]

LLaMA: Open and Efficient Foundation Language Models

Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozi`ere, B., Goyal, N., Hambro, E., Azhar, F., et al. Llama: Open and efficient foundation lan- guage models.arXiv preprint arXiv:2302.13971,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

and G´omez-Rodr´ıguez, C

Vilares, D. and G´omez-Rodr´ıguez, C. Head-qa: A health- care dataset for complex reasoning.arXiv preprint arXiv:1906.04701,

work page arXiv 1906
[14]

Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., and Bowman, S. R. Glue: A multi-task benchmark and anal- ysis platform for natural language understanding.arXiv preprint arXiv:1804.07461,

work page internal anchor Pith review arXiv
[15]

A., Xiao, S., Du, L., Li, G., and Zhang, J

Xing, X., Gao, B., Zhang, Z., Clifton, D. A., Xiao, S., Du, L., Li, G., and Zhang, J. Spikellm: Scaling up spiking neural network to large language models via saliency- based spiking.arXiv preprint arXiv:2407.04752, 2024a. Xing, X., Zhang, Z., Ni, Z., Xiao, S., Ju, Y ., Fan, S., Wang, Y ., Zhang, J., and Li, G. Spikelm: Towards general spike-driven langua...

work page arXiv
[16]

Spike-driven transformer v2: Meta spiking neural network architecture inspiring the design of next-generation neuro- morphic chips

Yao, M., Hu, J., Zhou, Z., Yuan, L., Tian, Y ., Xu, B., and Li, G. Spike-driven transformer.Advances in neural information processing systems, 36:64043–64058, 2023a. Yao, M., Zhao, G., Zhang, H., Hu, Y ., Deng, L., Tian, Y ., Xu, B., and Li, G. Attention spiking neural networks. IEEE Transactions on Pattern Analysis and Machine In- telligence, 2023b. Yao,...

work page arXiv
[17]

HellaSwag: Can a Machine Really Finish Your Sentence?

Zellers, R., Holtzman, A., Bisk, Y ., Farhadi, A., and Choi, Y . Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830,

work page internal anchor Pith review arXiv 1905
[18]

Ternarybert: Distillation-aware ultra-low bit bert,

Zhang, W., Hou, L., Yin, Y ., Shang, L., Chen, X., Jiang, X., and Liu, Q. Ternarybert: Distillation-aware ultra-low bit bert.arXiv preprint arXiv:2009.12812,

work page arXiv 2009
[19]

Spikingformer: Spike-driven residual learning for transformer-based spiking neural network,

Zhou, C., Yu, L., Zhou, Z., Ma, Z., Zhang, H., Zhou, H., and Tian, Y . Spikingformer: Spike-driven residual learn- ing for transformer-based spiking neural network.arXiv preprint arXiv:2304.11954, 2023a. Zhou, C., Zhang, H., Zhou, Z., Yu, L., Huang, L., Fan, X., Yuan, L., Ma, Z., Zhou, H., and Tian, Y . Qkformer: Hierarchical spiking transformer using qk ...

work page arXiv
[20]

SpikeGPT: Generative pre-trained lan- guage model with spiking neural networks,

Zhou, Z., Zhu, Y ., He, C., Wang, Y ., Y AN, S., Tian, Y ., and Yuan, L. Spikformer: When spiking neural network meets transformer. InThe Eleventh International Conference on Learning Representations, 2023b. URL https:// openreview.net/forum?id=frE4fUwz_h. Zhu, R.-J., Zhao, Q., Li, G., and Eshraghian, J. K. Spikegpt: Generative pre-trained language model ...

work page arXiv
[21]

V ersion Statement We have further refined and improved the paper, building upon the previous version available athttps://openreview

10 Submission and Formatting Instructions for ICML 2026 Appendix A. V ersion Statement We have further refined and improved the paper, building upon the previous version available athttps://openreview. net/forum?id=7PKGMNcM0w. B. Dataset Introduction General Language Understanding Evaluation (GLUE).GLUE benchmark (Wang et al.,

2026
[22]

is a coreference resolution benchmark designed to test commonsense reasoning by requiring models to resolve pronoun references that cannot be disambiguated by syntax alone. C. Energy consumption SNNs replace traditional multiply-accumulate (MAC) operations with low-power accumulate (AC) operations. For ANNs, the overall energy consumption can be directly ...

2021
[23]

13 Submission and Formatting Instructions for ICML 2026 Table 8.Spiking neuron comparison for WE-Spikingformer pretraining

NI-LIF significantly boosts training simulation speed compared to T-LIF, while maintaining comparable model performance. 13 Submission and Formatting Instructions for ICML 2026 Table 8.Spiking neuron comparison for WE-Spikingformer pretraining. Backbone Neuron T Average accuracy (%) Training time ( hours) WE-Spikingformer NI-LIF 4 65.7 26 WE-Spikingformer...

2026
[24]

(2025) reports that this Top-K mechanism achieves approximately an 8× improvement in energy efficiency over the standard Softmax layer

In addition, Dong et al. (2025) reports that this Top-K mechanism achieves approximately an 8× improvement in energy efficiency over the standard Softmax layer. K. Additional Ablation Study for different WTAs and time steps on CRT. Table 11 shows the ablation study for different WTAs and time steps on CRT. The performance of Hard WTA, Top-K WTA (K=3), and...

2025