arxiv: 2605.13859 · v1 · submitted 2026-04-14 · 💻 cs.NE · cs.AI· cs.LG

Recognition: 2 theorem links

· Lean Theorem

BiSpikCLM: A Spiking Language Model integrating Softmax-Free Spiking Attention and Spike-Aware Alignment Distillation

Sihang Guo , Chenlin Zhou , Jiaqi Wang , Kehai Chen , Qingyan Meng , Zhengyu Ma

Authors on Pith no claims yet

Pith reviewed 2026-05-15 07:06 UTC · model grok-4.3

classification 💻 cs.NE cs.AIcs.LG

keywords spiking neural networkscausal language modelbinary spikingsoftmax-free attentionalignment distillationenergy-efficient NLPMatMul-free model

0 comments

The pith

BiSpikCLM creates the first fully binary spiking causal language model that avoids all floating-point matrix multiplications and softmax.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to demonstrate that spiking neural networks can function as viable, low-energy replacements for conventional large language models in autoregressive text generation. It does so by building BiSpikCLM as a fully binary, MatMul-free architecture that replaces standard attention with Softmax-Free Spiking Attention and trains the spiking student via Spike-Aware Alignment Distillation that matches embeddings, attention maps, features, and logits to an ANN teacher. The resulting models reach competitive generation quality while consuming only 4.16 to 5.87 percent of the compute and far fewer training tokens than dense baselines. A reader should care because the work shows a concrete route to running capable language models on severely power-limited hardware without hidden floating-point operations at inference time.

Core claim

BiSpikCLM integrates Softmax-Free Spiking Attention to remove softmax and floating-point operations from autoregressive attention and applies Spike-Aware Alignment Distillation to align the binary spiking student with a standard ANN teacher across embeddings, attention maps, intermediate features, and output logits, thereby enabling the first fully binary spiking MatMul-free causal language model to attain competitive performance on natural language generation tasks at only 4.16 percent to 5.87 percent of conventional computational cost.

What carries the argument

Softmax-Free Spiking Attention (SFSA), which performs attention using only binary spikes without softmax or floating-point arithmetic, and Spike-Aware Alignment Distillation (SpAD), which performs multi-component alignment between ANN teacher and SNN student to support efficient training.

If this is right

The 1.3B-scale model reaches comparable performance after training on only 5.6 percent of the usual number of tokens.
All intensive floating-point matrix multiplications and nonlinearities are eliminated from the inference path.
Fully binary spike-driven language models become feasible without sacrificing autoregressive causal structure.
Multi-level distillation offers a practical route for scaling brain-inspired spiking NLP architectures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same SFSA and SpAD combination could be applied to non-language sequence tasks such as time-series forecasting.
If the efficiency ratio holds at larger scales, the models could run on neuromorphic chips with orders-of-magnitude lower energy per token.
Binary spike representations may allow further hardware-level optimizations such as event-driven memory access that the paper does not explore.
Combining the approach with additional compression methods could push the compute fraction even lower while preserving the fully spiking constraint.

Load-bearing premise

Spike-Aware Alignment Distillation can transfer knowledge from the floating-point ANN teacher to the binary SNN student across multiple layers without causing permanent capacity loss or requiring any floating-point operations during inference.

What would settle it

A BiSpikCLM model trained with the proposed SFSA and SpAD methods that exhibits substantially higher perplexity or lower generation quality on standard language-modeling benchmarks than a matched non-spiking baseline at identical scale would falsify the claim of competitive performance.

Figures

Figures reproduced from arXiv: 2605.13859 by Chenlin Zhou, Jiaqi Wang, Kehai Chen, Qingyan Meng, Sihang Guo, Zhengyu Ma.

**Figure 1.** Figure 1: Overview of Softmax-Free Spiking Attention (SFSA). Left: Comparison between Vanilla Causal Self-Attention (CSA) (bottom) and SFSA (top). CSA uses softmax and additive masks, while SFSA employs spike-based activation and binary causal masking. Right: Detailed SFSA pipeline, showing spike-form Q, K, V computation, masked integer attention, spiking activation, and spike-based output, enabling fully discrete a… view at source ↗

**Figure 2.** Figure 2: Left depicts the BiSpikCLM framework, detailing the operations of the Softmax-Free Spiking Attention (SFSA) module and the Spiking Feed-Forward Network (SFFN) module. Right compares the computational process of vanilla Causal Self-Attention (CSA), Spiking Self-Attention (SSA), and SFSA, where red spikes represent binary values of 1 and all other values are 0. rons with causal attention for softmax-free, en… view at source ↗

**Figure 3.** Figure 3: Overview of our Spike-Aware Alignment Distillation (SpAD) framework. Knowledge is transferred from a frozen ANN teacher to a trainable SNN student via five alignment modules: (1) Embedding Alignment (EA); (2) Spike-Attention Alignment (SAA); (3) Spike-Feature Alignment (SFA); (4) Soft-Target Alignment (STA); and (5) Hard-Target Alignment (HTA). Losses include MSE, CE, and spike-aware temporal strategies. I… view at source ↗

**Figure 4.** Figure 4: Visualization of ablation experiments. accuracy. BiSpikCLM consumes an order of magnitude less energy than ANN baselines (e.g., 9.43 mJ vs. 126.01 mJ at 125M) while achieving over 93% of the accuracy. Across 0.125B–1.3B parameters, it maintains competitive performance at only 4.16%–5.87% of the computational cost. Moreover, increasing time steps slightly improves performance (36.05% → 36.50% at 125M) with … view at source ↗

**Figure 5.** Figure 5: Weight distribution comparison between Artificial Neural Networks (ANNs) and Spiking Neural Networks (SNNs). ANNs typically exhibit a more concentrated weight distribution around zero, especially in early layers (e.g., Layer 0). In deeper layers (e.g., Layer 11), their weight distribution becomes slightly more spread out but remains relatively compact, indicating tightly clustered weights that contribute t… view at source ↗

**Figure 6.** Figure 6: Layer-wise firing rate heatmaps under different temporal resolutions. (a) At T = 2, firing activity is sparse and concentrated in a small subset of layers. (b) At T = 3, elevated firing rates begin to propagate across more layers, indicating increased temporal integration. (c) With T = 4, spiking activity becomes more distributed, engaging a larger portion of the network. (d) At T = 8, firing patterns are … view at source ↗

**Figure 7.** Figure 7: This figure presents the spiking activity across token positions for four representative layers (Layer 0, Layer 3, Layer 7, and Layer 11) at an early inference time step. The firing patterns are relatively sparse and uniformly distributed, particularly in the lower layers. This reflects the initial stage of neuronal processing, where the model begins encoding input signals with limited temporal context. No… view at source ↗

**Figure 8.** Figure 8: At T = 3, the firing distributions become slightly more structured across token positions and layers. While lower layers maintain broadly distributed activity, deeper layers begin to display early signs of selective activation. Compared to T = 2, this figure reveals the onset of temporal refinement, indicating that additional time steps allow the model to initiate more context-sensitive computation, partic… view at source ↗

**Figure 9.** Figure 9: With four time steps, the model exhibits more pronounced spatiotemporal differentiation in firing behavior. Activity becomes more variable across token positions, and certain regions in deeper layers start to display concentrated firing. This suggests that the network is engaging in increasingly specialized processing, distributing its computation more selectively based on both input semantics and accumula… view at source ↗

**Figure 10.** Figure 10: By T = 8, the firing patterns exhibit substantial temporal evolution and structural complexity. Deeper layers, in particular, show heightened and more focused activation for specific token regions, reflecting refined internal representations. This level of activity suggests that the model has transitioned into a more stable and semantically rich encoding phase. The marked increase in firing diversity and … view at source ↗

**Figure 11.** Figure 11: Training loss curves for the BiSpikCLM-1.3B model, comparing the 10B and 25B token training regimes. The smooth downward trend confirms stable convergence. Beyond quantitative metrics, we qualitatively evaluated the conversational abilities of our BiSpikCLM-1.3B model trained with 10B and 25B tokens [PITH_FULL_IMAGE:figures/full_fig_p029_11.png] view at source ↗

read the original abstract

Spiking Neural Networks (SNNs) offer promising energy-efficient alternatives to large language models (LLMs) due to their event-driven nature and ultra-low power consumption. However, to preserve capacity, most existing spiking LLMs still incur intensive floating-point matrix multiplication (MatMul) and nonlinearities, or training difficulties arising from the complex spatiotemporal dynamics. To address these challenges, we propose BiSpikCLM, the first fully binary spiking MatMul-free causal language model. BiSpikCLM introduces Softmax-Free Spiking Attention (SFSA), eliminating softmax and floating-point operations in autoregressive language modeling. For efficient training, we introduce Spike-Aware Alignment Distillation (SpAD), which aligns ANN teacher and SNN student across embeddings, attention maps, intermediate features, and output logits. SpAD framework allows BiSpikCLM to reach comparable performance to ANN counterparts using substantially fewer training tokens (e.g., only 5.6% of the tokens for the 1.3B model). As a result, BiSpikCLM achieves competitive performance at only 4.16% - 5.87% of the computational cost on natural language generation tasks. Our results highlight the feasibility and effectiveness of fully binary spike-driven LLMs and establish the distillation as a promising pathway for brain-inspired spiking NLP.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BiSpikCLM shows a workable path to fully binary spiking LLMs via SFSA and SpAD, but the MatMul-free inference claim needs explicit verification that no FP ops leak through distillation.

read the letter

The main point is that this paper delivers the first claimed fully binary, MatMul-free spiking causal language model by replacing softmax attention with SFSA and using SpAD to train the student on far fewer tokens than usual. The 1.3B model reportedly matches ANN performance after seeing only 5.6% of the tokens, which would be a real efficiency win if it holds up at inference time with strictly binary operations. What stands out is the concrete engineering: SFSA removes floating-point nonlinearities in the attention path, and SpAD aligns embeddings, attention maps, features, and logits in a spike-aware way. That combination looks new relative to earlier spiking LLM attempts that still kept some dense MatMul or softmax. The reported 4-6% compute cost on generation tasks is the kind of number that matters for edge deployment. The soft spots are around verification. The abstract and results do not show ablations that isolate how much each alignment level in SpAD contributes, nor do they include error bars or multiple runs. More critically, it is not obvious from the description how attention-map and feature alignment during training avoids introducing hidden floating-point MatMuls or conversions that would survive into the final inference graph. If any such operation remains, both the cost reduction and the “first fully binary” label weaken. The low token budget also raises the practical question of whether capacity loss is fully recovered or just masked on the reported benchmarks. This work is aimed at researchers building neuromorphic or low-power NLP systems who already know the spiking literature. A reader who cares about energy numbers on real hardware would get value from the empirical claims once the implementation details are checked. The paper deserves a serious referee because the core idea is timely and the empirical direction is clear, even if the current write-up leaves the strict binary guarantee under-specified. I would send it out for review rather than desk reject.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes BiSpikCLM as the first fully binary spiking MatMul-free causal language model. It introduces Softmax-Free Spiking Attention (SFSA) to eliminate softmax and floating-point operations during autoregressive language modeling, along with Spike-Aware Alignment Distillation (SpAD) to align an ANN teacher with an SNN student across embeddings, attention maps, intermediate features, and output logits. This enables training with substantially fewer tokens (e.g., only 5.6% for the 1.3B model) while claiming competitive performance on natural language generation tasks at 4.16%-5.87% of the computational cost of conventional models.

Significance. If the claims of strict binarity, MatMul-freeness at inference, and robust empirical performance are substantiated with proper controls, this would constitute a meaningful advance in energy-efficient spiking neural networks for language modeling. The distillation approach that achieves comparable results with minimal tokens is a clear strength and could inform scalable training of brain-inspired NLP systems.

major comments (2)

[Abstract] Abstract: The central claim that BiSpikCLM is 'the first fully binary spiking MatMul-free causal language model' with competitive performance at 4.16%-5.87% computational cost rests on empirical results, yet the abstract supplies no error bars, ablation studies, or explicit verification that all operations remain strictly binary and MatMul-free throughout inference. This directly affects the soundness of the primary contribution.
[SpAD description] SpAD framework (as described in abstract): Alignment of attention maps, intermediate features, and logits between ANN teacher and SNN student typically requires dense floating-point computations; the manuscript must explicitly show that no such operations (e.g., normalization, output projections, or spike-to-float conversions) are present in the student inference path, as any leakage would invalidate both the cost-reduction numbers and the 'fully binary MatMul-free' designation.

minor comments (2)

[Abstract] Abstract: The computational-cost range (4.16%-5.87%) should reference specific tables or figures that break down the metric (FLOPs, energy, or spike counts) per task for reproducibility.
[Abstract] Abstract: Clarify whether 'computational cost' refers to training or inference and provide the exact baseline models used for the percentage comparison.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments have prompted us to strengthen the presentation of our empirical validations and to clarify the separation between training and inference paths. We address each major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that BiSpikCLM is 'the first fully binary spiking MatMul-free causal language model' with competitive performance at 4.16%-5.87% computational cost rests on empirical results, yet the abstract supplies no error bars, ablation studies, or explicit verification that all operations remain strictly binary and MatMul-free throughout inference. This directly affects the soundness of the primary contribution.

Authors: We agree that the abstract would benefit from additional context. In the revised version we have added a sentence referencing the error bars and standard deviations reported across our main results (Table 2, Figure 3) and the ablation studies in Section 4.3. We have also inserted a brief clause confirming that inference uses only binary spike operations with no floating-point MatMul or softmax, directing readers to the operation-count analysis and pseudocode in Section 3.2 that verify the strict binarity and MatMul-freeness. revision: yes
Referee: [SpAD description] SpAD framework (as described in abstract): Alignment of attention maps, intermediate features, and logits between ANN teacher and SNN student typically requires dense floating-point computations; the manuscript must explicitly show that no such operations (e.g., normalization, output projections, or spike-to-float conversions) are present in the student inference path, as any leakage would invalidate both the cost-reduction numbers and the 'fully binary MatMul-free' designation.

Authors: We thank the referee for highlighting this distinction. SpAD is applied solely during training to align the SNN student with the ANN teacher; the student inference path is completely decoupled and uses only the SFSA module with binary spikes. In the revised manuscript we have expanded Section 3.4 with a dedicated inference flowchart and pseudocode that explicitly show the absence of normalization, dense projections, or any spike-to-float conversion at inference time. All operations remain in the binary spike domain, thereby preserving the reported 4.16–5.87 % computational cost. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on empirical measurements of novel components

full rationale

The paper introduces SFSA to eliminate softmax and FP MatMul in attention, and SpAD as a distillation procedure that aligns embeddings, attention maps, features, and logits. Performance numbers (4.16%-5.87% cost, competitive accuracy with 5.6% tokens) are presented as measured experimental outcomes on NLG tasks, not as quantities derived by construction from fitted parameters or self-referential definitions. No equation reduces the target result to its own inputs; the derivation chain consists of architectural proposals whose validity is checked externally via benchmarks rather than presupposed.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard assumptions of spiking neuron models and knowledge distillation; no new physical entities are postulated. Free parameters are the usual training hyperparameters and scaling factors implicit in any large-model experiment.

axioms (2)

domain assumption Spiking neurons can be trained to approximate continuous ANN activations via rate or temporal coding without loss of representational capacity when distillation is applied.
Invoked when claiming that SpAD alignment preserves performance.
domain assumption All matrix multiplications and nonlinearities can be replaced by binary spike operations while preserving autoregressive language modeling semantics.
Core premise of the SFSA design.

pith-pipeline@v0.9.0 · 5568 in / 1444 out tokens · 51024 ms · 2026-05-15T07:06:47.550153+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/Breath1024.lean period8 / flipAt512 echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

SFSA replaces softmax with spike-based dot products, causal masking, and spiking neuron activation (Eq. 1, Fig. 1, Alg. 1)
IndisputableMonolith/Foundation/ArrowOfTime.lean zAtStep / forward_accumulates unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Rate-MSE loss aligns time-averaged spike rates over T steps (Eq. 9, 13)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · 9 internal anchors

[1]

GPT-4 Technical Report

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. GPT-4 technical report.arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 1877–1901,

work page 1901
[3]

BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions

Clark, C., Lee, K., Chang, M.-W., Kwiatkowski, T., Collins, M., and Toutanova, K. Boolq: Exploring the surprising difficulty of natural yes/no questions.arXiv preprint arXiv:1905.10044,

work page internal anchor Pith review Pith/arXiv arXiv 1905
[4]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., and Tafjord, O. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Advancing residual learning towards powerful deep spiking neural networks

Hu, Y ., Wu, Y ., Deng, L., and Li, G. Advancing residual learning towards powerful deep spiking neural networks. arXiv preprint arXiv:2112.08954, 7:7,

work page arXiv
[6]

Tinybert: Distilling bert for natural language understanding.arXiv preprint arXiv:1909.10351,

Jiao, X., Yin, Y ., Shang, L., Jiang, X., Chen, X., Li, L., Wang, F., and Liu, Q. Tinybert: Distilling bert for natural language understanding.arXiv preprint arXiv:1909.10351,

work page arXiv 1909
[7]

K., Pandey, T., Bha- gat, A., and Rish, I

Kaushal, A., Vaidhya, T., Mondal, A. K., Pandey, T., Bha- gat, A., and Rish, I. Spectra: Surprising effectiveness of pretraining ternary language models at scale.arXiv preprint arXiv:2407.12327,

work page arXiv
[8]

Kundu, S., Datta, G., Pedram, M., and Beerel, P. A. Spike- thrift: Towards energy-efficient deep spiking neural net- works by limiting spiking activity via attention-guided compression. InProceedings of the IEEE/CVF winter conference on applications of computer vision, pp. 3953– 3962, 2021a. 9 Submission and Formatting Instructions for ICML 2026 Kundu, S....

work page 2026
[9]

Spikebert: A language spikformer learned from bert with knowledge distillation.arXiv preprint arXiv:2308.15122,

Lv, C., Li, T., Xu, J., Gu, C., Ling, Z., Zhang, C., Zheng, X., and Huang, X. Spikebert: A language spikformer learned from bert with knowledge distillation.arXiv preprint arXiv:2308.15122,

work page arXiv
[10]

Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering

Mihaylov, T., Clark, P., Khot, T., and Sabharwal, A. Can a suit of armor conduct electricity? a new dataset for open book question answering.arXiv preprint arXiv:1809.02789,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

Sanh, V ., Debut, L., Chaumond, J., and Wolf, T. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter.arXiv preprint arXiv:1910.01108,

work page internal anchor Pith review Pith/arXiv arXiv 1910
[12]

Omniquant: Omnidirectionally calibrated quantization for large lan- guage models.arXiv preprint arXiv:2308.13137,

Shao, W., Chen, M., Zhang, Z., Xu, P., Zhao, L., Li, Z., Zhang, K., Gao, P., Qiao, Y ., and Luo, P. Omniquant: Omnidirectionally calibrated quantization for large lan- guage models.arXiv preprint arXiv:2308.13137,

work page arXiv
[13]

LLaMA: Open and Efficient Foundation Language Models

Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozi`ere, B., Goyal, N., Hambro, E., Azhar, F., et al. Llama: Open and efficient foundation lan- guage models.arXiv preprint arXiv:2302.13971,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

10 Submission and Formatting Instructions for ICML 2026 Tropp, J. A. et al. An introduction to matrix concentra- tion inequalities.Foundations and Trends® in Machine Learning, 8(1-2):1–230,

work page 2026
[15]

HEAD-QA: A Healthcare Dataset for Complex Reasoning

Vilares, D. and G´omez-Rodr´ıguez, C. Head-qa: A health- care dataset for complex reasoning.arXiv preprint arXiv:1906.04701,

work page internal anchor Pith review Pith/arXiv arXiv 1906
[16]

Bitnet: Scaling 1- bit transformers for large language models.arXiv preprint arXiv:2310.11453,

Wang, H., Ma, S., Dong, L., Huang, S., Wang, H., Ma, L., Yang, F., Wang, R., Wu, Y ., and Wei, F. Bitnet: Scaling 1- bit transformers for large language models.arXiv preprint arXiv:2310.11453,

work page arXiv
[17]

A., Xiao, S., Zhang, W., Du, L., Zhang, Z., Li, G., and Zhang, J

Xing, X., Gao, B., Liu, Z., Clifton, D. A., Xiao, S., Zhang, W., Du, L., Zhang, Z., Li, G., and Zhang, J. Spikellm: Scaling up spiking neural network to large language mod- els via saliency-based spiking. InThe Thirteenth Interna- tional Conference on Learning Representations, 2024a. Xing, X., Zhang, Z., Ni, Z., Xiao, S., Ju, Y ., Fan, S., Wang, Y ., Zhan...

work page arXiv
[18]

HellaSwag: Can a Machine Really Finish Your Sentence?

Zellers, R., Holtzman, A., Bisk, Y ., Farhadi, A., and Choi, Y . Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830,

work page internal anchor Pith review Pith/arXiv arXiv 1905
[19]

OPT: Open Pre-trained Transformer Language Models

Zhang, S., Roller, S., Goyal, N., Artetxe, M., Chen, M., Chen, S., Dewan, C., Diab, M., Li, X., Lin, X. V ., et al. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Spikingformer: Spike-driven residual learn- ing for transformer-based spiking neural network.arXiv preprint arXiv:2304.11954,

Zhou, C., Yu, L., Zhou, Z., Ma, Z., Zhang, H., Zhou, H., and Tian, Y . Spikingformer: Spike-driven residual learn- ing for transformer-based spiking neural network.arXiv preprint arXiv:2304.11954,

work page arXiv
[21]

Spikformer: When spiking neural network meets transformer.arXiv preprint arXiv:2209.15425,

Zhou, Z., Zhu, Y ., He, C., Wang, Y ., Yan, S., Tian, Y ., and Yuan, L. Spikformer: When spiking neural network meets transformer.arXiv preprint arXiv:2209.15425,

work page arXiv
[22]

Zhu, R.-J., Zhao, Q., Li, G., and Eshraghian, J. K. Spikegpt: Generative pre-trained language model with spiking neu- ral networks.arXiv preprint arXiv:2302.13939,

work page arXiv
[23]

Use of LLMs In this work, we used Large Language Models (LLMs) in a limited and auxiliary capacity

11 Submission and Formatting Instructions for ICML 2026 A. Use of LLMs In this work, we used Large Language Models (LLMs) in a limited and auxiliary capacity. Specifically, LLMs were employed for retrieval and discovery of related literature on Spiking Neural Networks (SNNs), neuromorphic computing, and energy-efficient large language models. This assiste...

work page 2026
[24]

We setP5 i=1 λi = 1 to keep the overall loss scale stable across experiments. The weights are selected based on preliminary sweeps: placing more emphasis on token-level attention/output alignment improves convergence and accuracy, whereas assigning too much weight to auxiliary terms can hurt training stability. Inference Time StepsTo investigate the trade...

work page 2024
[25]

Assume the pre-synaptic input is uniformly bounded, i.e., ∥Xt∥ ≤Mfor allt

Let ∥ · ∥ denote any norm that satisfies (i) the triangle inequality and (ii) positive homogeneity (e.g., ℓ2 norm for vectors or Frobenius norm for matrices). Assume the pre-synaptic input is uniformly bounded, i.e., ∥Xt∥ ≤Mfor allt. 18 Submission and Formatting Instructions for ICML 2026 Lemma E.1(Uniform bound one t).Under the above assumptions andβ∈(0,...

work page 2026
[26]

In each figure, rows correspond to the selected layers, while columns represent discrete inference time steps

under different inference time steps. In each figure, rows correspond to the selected layers, while columns represent discrete inference time steps. Within each subplot, the horizontal axis indicates token positions and the vertical axis indexes neurons (or channels) within the corresponding layer; color intensity encodes the firing magnitude, 25 Submissi...

work page 2026
[27]

(2023), Wang et al

While quantized ANNs like Shao et al. (2023), Wang et al. (2023), and Kaushal et al. (2024) may exhibit a marginal edge in accuracy, we emphasize that these represent fundamentally different methodological paradigms. Therefore, the primary contribution of our work is not to surpass quantization methods in accuracy, but to pioneer and validate a new, energ...

work page 2023
[28]

Figure 11.Training loss curves for the BiSpikCLM-1.3B model, comparing the 10B and 25B token training regimes

This stable convergence behavior across different training scales demonstrates the robustness of our proposed BiSpikCLM architecture and the effectiveness of the SFSA mechanism in facilitating the optimization of spike-based language models. Figure 11.Training loss curves for the BiSpikCLM-1.3B model, comparing the 10B and 25B token training regimes. The ...

work page 2026
[29]

The firing rate (sparsity) is shown in parentheses

Table 11.Perplexity (PPL) on WikiText-2 with varying context lengths and firing thresholds. The firing rate (sparsity) is shown in parentheses. Model Params (B) Firing Threshold Context Length 512 1024 2048 4096 8192 OPT 1.300 - 16.26 13.58 11.13 - - Llama-3.2 1.200 - 12.93 10.96 9.76 9.02 8.54 BiSpikCLM 1.300 0.70 36.72 (0.184) 32.49 (0.183) 29.34 (0.181...

work page 2048