pith. machine review for the scientific record. sign in

arxiv: 2604.25570 · v1 · submitted 2026-04-28 · 💻 cs.CV

Recognition: unknown

Vision SmolMamba: Spike-Guided Token Pruning for Energy-Efficient Spiking State-Space Vision Models

Authors on Pith no claims yet

Pith reviewed 2026-05-07 16:37 UTC · model grok-4.3

classification 💻 cs.CV
keywords spiking neural networkstoken pruningstate space modelsenergy efficient visionevent-based visionneuromorphic computingspiking transformers
0
0 comments X

The pith

Vision SmolMamba uses spike strength and latency to prune tokens in a state-space spiking model, cutting energy use by at least 1.5 times versus prior spiking transformers while holding or improving accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a spiking vision architecture that replaces quadratic self-attention with linear selective state-space recurrence and adds a pruning step guided by spike signals. The pruning step measures each token's importance through its spike activation strength and the timing of its first spike, then progressively discards the least important tokens across layers. This combination is tested on both ordinary image datasets and event-camera recordings, where it delivers the same or higher classification accuracy at substantially lower estimated energy. A sympathetic reader would care because spiking networks are already sparse and event-driven; removing the quadratic bottleneck while keeping the sparsity should make them practical for larger images or longer video sequences without custom hardware. The results are presented as evidence that spike-guided sparsity and state-space modeling together form a scalable route for energy-efficient visual computation.

Core claim

The central claim is that a Spike-Guided Spatio-Temporal Token Pruner (SST-TP) can be fused with bidirectional spiking state-space recurrence inside SmolMamba blocks to produce a vision backbone that performs long-range modeling in linear time, progressively removes redundant tokens on the basis of spike activity, and thereby achieves at least 1.5 times lower estimated energy cost than both spiking Transformer baselines and an earlier Spiking Mamba variant on ImageNet-1K, CIFAR, CIFAR10-DVS, and DVS128 Gesture while preserving competitive or better accuracy.

What carries the argument

The Spike-Guided Spatio-Temporal Token Pruner (SST-TP), which scores token importance from spike activation strength and first-spike latency and removes the lowest-scoring tokens layer by layer before feeding the survivors into spiking bidirectional state-space recurrence.

If this is right

  • The architecture scales to higher-resolution inputs or longer temporal sequences because token count is reduced while computation per token stays linear.
  • Spiking state-space models can now exploit the same token-sparsity benefits previously limited to attention-based spiking transformers.
  • Energy estimates on both static-image and event-based benchmarks improve by at least 1.5 times relative to prior spiking attention and Spiking Mamba baselines at matched accuracy.
  • The same spike-guided pruning rule can be applied inside other recurrent or state-space spiking blocks without changing the underlying spike-driven dynamics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the pruning rule generalizes across datasets, it could be used to adaptively allocate compute in real-time neuromorphic vision pipelines where input sparsity varies frame to frame.
  • The linear-time recurrence plus early token removal may allow spiking models to run on low-power edge devices that currently cannot sustain full-resolution transformer attention.
  • Combining the reported energy numbers with known neuromorphic hardware characteristics would let one estimate end-to-end latency and power for specific chips.

Load-bearing premise

Spike activation strength and first-spike latency are sufficient to identify which tokens can be safely discarded without removing information needed for correct final classification on both static images and event streams.

What would settle it

An experiment on ImageNet-1K or CIFAR10-DVS in which the pruned SmolMamba model shows an accuracy drop larger than the unpruned version or where the measured energy reduction falls below 1.5 times the cost of the strongest spiking Transformer baseline.

Figures

Figures reproduced from arXiv: 2604.25570 by Dewei Bai, Hong Qu, Hongxiang Peng, Yi Zhang, Yunyun Zeng, Ziyu Zhang.

Figure 1
Figure 1. Figure 1: The estimated energy–accuracy landscape of ANN and view at source ↗
Figure 2
Figure 2. Figure 2: Overview of Vision SmolMamba. (a) The overall architecture: spike-form visual patches generated with SPS are view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of the proposed Spike-Guided Spatio view at source ↗
Figure 4
Figure 4. Figure 4: Visualization of token pruning results in ImageNet-1k. view at source ↗
Figure 5
Figure 5. Figure 5: Train loss and kept tokens of Vision SmolMamba-2-256 on Cifar-10. (a) Without Z-score normalization, pruning stays view at source ↗
Figure 6
Figure 6. Figure 6: Train loss and kept tokens of Vision SmolMamba-8-512 on ImageNet-1K. (a) Without Z-score normalization, pruning view at source ↗
read the original abstract

Spiking Transformers have shown strong potential for long-range visual modeling through spike-driven self-attention. However, their quadratic token interactions remain fundamentally misaligned with the sparse and event-driven nature of spiking neural computation. To address this limitation, we propose Vision SmolMamba, an energy-efficient spiking state-space architecture that integrates spike-driven dynamics with linear-time selective recurrence. The key idea is a Spike-Guided Spatio-Temporal Token Pruner (SST-TP), which estimates token importance using both spike activation strength and first-spike latency. This mechanism progressively removes redundant tokens while preserving salient spatio-temporal information, enabling efficient scaling with token sparsity. Based on this mechanism, the proposed SmolMamba block incorporates spike events directly into bidirectional state-space recurrence, forming a spiking state-space vision backbone for efficient long-range modeling. Extensive experiments on both static and event-based benchmarks, including ImageNet-1K, CIFAR10/100, CIFAR10-DVS, and DVS128 Gesture, demonstrate that Vision SmolMamba consistently achieves superior accuracy-efficiency trade-offs. In particular, it reduces the estimated energy cost by at least 1.5x compared with prior spiking Transformer baselines and a Spiking Mamba variant while maintaining competitive or improved accuracy. These results demonstrate that combining spike-guided token sparsity with state-space modeling offers a scalable and energy-efficient paradigm for spiking vision systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper proposes Vision SmolMamba, a spiking state-space vision architecture that integrates spike-driven dynamics with linear-time selective recurrence via a novel Spike-Guided Spatio-Temporal Token Pruner (SST-TP). SST-TP estimates token importance from spike activation strength and first-spike latency to progressively prune redundant tokens while preserving salient spatio-temporal features. The SmolMamba block embeds spike events into bidirectional state-space recurrence. Experiments on ImageNet-1K, CIFAR10/100, CIFAR10-DVS, and DVS128 Gesture show superior accuracy-efficiency trade-offs, including at least 1.5x lower estimated energy cost versus spiking Transformer baselines and a Spiking Mamba variant, with competitive or improved accuracy.

Significance. If the central claims hold, the work offers a scalable route to energy-efficient spiking vision models by replacing quadratic token interactions with state-space recurrence augmented by spike-guided sparsity. This addresses a key misalignment between spiking computation and Transformer-style attention, with empirical support across both static and event-based vision benchmarks. The approach could inform low-power neuromorphic deployments, though its impact depends on the robustness of the pruning mechanism and energy estimates.

major comments (2)
  1. [Abstract and Experiments] Abstract and Experiments: The reported energy reductions are labeled 'estimated' with no description of the estimation method, hardware model, power model parameters, or inclusion of error bars/statistical tests. This is load-bearing for the headline 1.5x efficiency claim and the accuracy-efficiency trade-off.
  2. [SST-TP description] SST-TP description: Token pruning relies on spike activation strength and first-spike latency as proxies for importance, yet no independent verification (e.g., information-preservation metrics, reconstruction error, or targeted ablations on critical features) is provided to confirm that task-relevant spatio-temporal content is retained. This assumption is load-bearing because accuracy maintenance on event-based data (CIFAR10-DVS, DVS128 Gesture) depends on it; mis-pruning could explain the reported trade-off without true efficiency gains.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important aspects of clarity and validation that we will address through revisions. Below we respond point-by-point to the major comments.

read point-by-point responses
  1. Referee: [Abstract and Experiments] The reported energy reductions are labeled 'estimated' with no description of the estimation method, hardware model, power model parameters, or inclusion of error bars/statistical tests. This is load-bearing for the headline 1.5x efficiency claim and the accuracy-efficiency trade-off.

    Authors: We agree that explicit documentation of the energy estimation procedure is necessary to substantiate the efficiency claims. In the revised manuscript we will add a dedicated subsection in the Experiments section that fully describes the energy model. This will include: (i) the operation-counting methodology (spike events and state updates in the SmolMamba blocks), (ii) the assumed hardware platform and per-operation energy costs drawn from established neuromorphic literature, (iii) the precise power-model parameters, and (iv) error bars computed as standard deviations over multiple independent training runs. Where direct comparisons are presented we will also include statistical significance tests (paired t-tests) to quantify the reliability of the reported 1.5x gains. revision: yes

  2. Referee: [SST-TP description] Token pruning relies on spike activation strength and first-spike latency as proxies for importance, yet no independent verification (e.g., information-preservation metrics, reconstruction error, or targeted ablations on critical features) is provided to confirm that task-relevant spatio-temporal content is retained. This assumption is load-bearing because accuracy maintenance on event-based data (CIFAR10-DVS, DVS128 Gesture) depends on it; mis-pruning could explain the reported trade-off without true efficiency gains.

    Authors: We acknowledge that additional direct validation of the SST-TP pruning criterion would strengthen the paper. Although the maintained accuracy on event-based benchmarks provides supporting evidence that salient features are retained, we will incorporate new targeted experiments in the revision. These will comprise: (i) ablations contrasting SST-TP against random pruning and against single-proxy variants (activation strength only or latency only), (ii) a quantitative information-preservation metric that measures the fraction of high-activation spikes retained after pruning, and (iii) reconstruction-error analysis on a held-out subset of CIFAR10-DVS and DVS128 Gesture using a lightweight decoder trained to reconstruct the original spike sequences from the pruned token set. The results will be reported alongside the existing accuracy-efficiency curves. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on empirical benchmarks

full rationale

The paper presents an architectural proposal (SST-TP token pruning guided by spike strength and latency, integrated into a spiking state-space block) whose performance claims are supported by direct experiments on external benchmarks (ImageNet-1K, CIFAR10/100, CIFAR10-DVS, DVS128 Gesture). No equations, fitted parameters, or self-citations are shown to reduce the reported accuracy-efficiency trade-off to the inputs by construction. The 1.5x energy reduction is an observed experimental outcome rather than a definitional or fitted prediction. The derivation chain is therefore self-contained against external data.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 2 invented entities

Only the abstract is available, so the ledger is necessarily incomplete. The architecture introduces two new components whose internal parameters and neuron models are not enumerated here.

free parameters (2)
  • pruning threshold schedule
    Controls how aggressively tokens are removed at each stage; must be chosen or tuned.
  • state-space recurrence parameters
    Selective state update coefficients in the SmolMamba block.
axioms (1)
  • domain assumption Spiking neurons follow standard integrate-and-fire or similar dynamics
    Implicit in all spiking vision work; not re-derived.
invented entities (2)
  • Spike-Guided Spatio-Temporal Token Pruner (SST-TP) no independent evidence
    purpose: Estimates token importance from spike strength and latency to enable progressive pruning
    Core novel mechanism introduced to reconcile quadratic attention with sparse spikes.
  • SmolMamba block no independent evidence
    purpose: Bidirectional state-space recurrence that ingests spike events directly
    New architectural unit that replaces self-attention in the spiking backbone.

pith-pipeline@v0.9.0 · 5565 in / 1409 out tokens · 40572 ms · 2026-05-07T16:37:01.747110+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

57 extracted references · 3 canonical work pages

  1. [1]

    Spiking neural networks: A survey,

    J. D. Nunes, M. Carvalho, D. Carneiro, and J. S. Car- doso, “Spiking neural networks: A survey,”IEEE Access, vol. 10, pp. 60 738–60 764, 2022

  2. [2]

    Spiking neural networks: From bio-inspired computing to hybrid artificial intelligence,

    M. Bouvier, V . G. Santucci, T. Mesquida, and et al., “Spiking neural networks: From bio-inspired computing to hybrid artificial intelligence,”Frontiers in Neuro- science, vol. 13, p. 959, 2019

  3. [3]

    Incorporating learnable membrane time constant to enhance learning of spiking neural networks,

    H. Fang, Y . Zhou, Y . Tian, and et al., “Incorporating learnable membrane time constant to enhance learning of spiking neural networks,” inProceedings of ICCV, 2021, pp. 2661–2671

  4. [4]

    Spike-driven transformer,

    M. Yao, J. Hu, Z. Zhou, L. Yuan, Y . Tian, B. Xu, and G. Li, “Spike-driven transformer,” inProceedings of NeurIPS, vol. 36, 2023, pp. 64 043–64 058

  5. [5]

    Spikingvit: A multiscale spiking vision transformer model for event-based object detection,

    L. Yu, H. Chen, Z. Wanget al., “Spikingvit: A multiscale spiking vision transformer model for event-based object detection,”IEEE Transactions on Cognitive and Devel- opmental Systems, vol. 17, no. 1, pp. 130–146, 2025

  6. [6]

    Mamba: Linear-time sequence modeling with selective state spaces,

    A. Gu and T. Dao, “Mamba: Linear-time sequence modeling with selective state spaces,” inProceedings of COLM, 2024

  7. [7]

    Hungry hungry hippos: Towards language modeling with state space models,

    D. Y . Fu, T. Dao, K. K. Saab, A. W. Thomas, A. Rudra, and C. R ´e, “Hungry hungry hippos: Towards language modeling with state space models,” inProceedings of ICLR, 2023

  8. [8]

    SpikingSSMs: Learning long sequences with sparse and parallel spiking state space models,

    S. Shen, C. Wang, R. Huang, Y . Zhong, Q. Guo, Z. Lu, J. Zhang, and L. Leng, “SpikingSSMs: Learning long sequences with sparse and parallel spiking state space models,” inProceedings of AAAI, vol. 39, no. 19, 2025, pp. 20 380–20 388

  9. [9]

    P-SpikeSSM: Harnessing probabilistic spiking state space models for long-range dependency tasks,

    M. Bal and A. Sengupta, “P-SpikeSSM: Harnessing probabilistic spiking state space models for long-range dependency tasks,”Proceedings of ICLR, 2025

  10. [10]

    Efficient spiking point mamba for point cloud analysis,

    P. Wu, B. Chai, M. Zheng, W. Li, Z. Hu, J. Chen, Z. Zhang, H. Li, and X. Sun, “Efficient spiking point mamba for point cloud analysis,” inProceedings of ICCV, 2025, pp. 26 393–26 403

  11. [11]

    Spikemba: Multi-modal spiking saliency mamba for temporal video grounding,

    W. Li, X. Hong, R. Xiong, and X. Fan, “Spikemba: Multi-modal spiking saliency mamba for temporal video grounding,”arXiv preprint arXiv:2404.01174, 2024

  12. [12]

    Spikmamba: When snn meets mamba in event-based human action recognition,

    J. Chen, Y . Yang, S. Deng, D. Teng, and L. Pan, “Spikmamba: When snn meets mamba in event-based human action recognition,” inProceedings of MMAsia, 2024, pp. 1–8

  13. [13]

    State space models for event cameras,

    N. Zubic, M. Gehrig, and D. Scaramuzza, “State space models for event cameras,” inProceedings of CVPR, 2024, pp. 5819–5828

  14. [14]

    Spiking pointnet: Spiking neural networks for point clouds,

    D. Ren, Z. Ma, Y . Chen, W. Peng, X. Liu, Y . Zhang, and Y . Guo, “Spiking pointnet: Spiking neural networks for point clouds,” inProceedings of NeurIPS, vol. 36, 2023, pp. 41 797–41 808

  15. [15]

    Enhancing motion deblurring in high-speed scenes with spike streams,

    S. Chen, J. Zhang, Y . Zheng, T. Huang, and Z. Yu, “Enhancing motion deblurring in high-speed scenes with spike streams,” inProceedings of NeurIPS, vol. 36, 2023, pp. 70 279–70 292

  16. [16]

    Spiking deep residual networks,

    Y . Hu, H. Tang, and G. Pan, “Spiking deep residual networks,”IEEE Transactions on Neural Networks and Learning Systems, vol. 34, no. 8, pp. 5200–5205, 2021

  17. [17]

    Deep residual learning in spiking neural networks,

    W. Fang, Z. Yu, Y . Chen, T. Huang, T. Masquelier, and Y . Tian, “Deep residual learning in spiking neural networks,” inProceedings of Proceedings of NeurIPS, vol. 34, 2021, pp. 21 056–21 069

  18. [18]

    Attention spiking neural networks,

    M. Yao, G. Zhao, H. Zhang, Y . Hu, L. Deng, Y . Tian, B. Xu, and G. Li, “Attention spiking neural networks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 8, pp. 9393–9410, 2023

  19. [19]

    Spikformer: When spiking neural network meets transformer,

    Z. Zhou, Y . Zhu, C. He, Y . Wang, S. Y AN, Y . Tian, and L. Yuan, “Spikformer: When spiking neural network meets transformer,” inProceedings of ICLR, 2023

  20. [20]

    Capture the moment: High-speed imaging with spiking cameras through short-term plasticity,

    Y . Zheng, L. Zheng, Z. Yu, T. Huang, and S. Wang, “Capture the moment: High-speed imaging with spiking cameras through short-term plasticity,”IEEE Transac- tions on Pattern Analysis and Machine Intelligence, vol. 45, no. 7, pp. 8127–8142, 2023

  21. [21]

    Spikingformer: Spike-driven residual learn- ing for transformer-based spiking neural network.arXiv preprint arXiv:2304.11954,

    C. Zhou, L. Yu, Z. Zhou, Z. Ma, H. Zhang, H. Zhou, and Y . Tian, “Spikingformer: Spike-driven residual learning for transformer-based spiking neural network,”arXiv preprint arXiv:2304.11954, 2023

  22. [22]

    Spiking vision transformer with saccadic attention,

    S. Wang, M. Zhang, D. Zhang, A. Belatreche, Y . Xiao, Y . Liang, Y . Shan, Q. Sun, E. Zhang, and Y . Yang, “Spiking vision transformer with saccadic attention,” in Proceedings of ICLR, 2025

  23. [23]

    SparseSpikformer: A co-design framework for token and weight pruning in spiking transformer,

    Y . Liu, S. Xiao, B. Li, and Z. Yu, “SparseSpikformer: A co-design framework for token and weight pruning in spiking transformer,” inProceedings of ICASSP, 2024, pp. 6410–6414

  24. [24]

    Efficiently modeling long sequences with structured state spaces,

    A. Gu, K. Goel, and C. R ´e, “Efficiently modeling long sequences with structured state spaces,” inProceedings of ICLR, 2022

  25. [25]

    Sim- plified state space layers for sequence modeling,

    J. T. Smith, A. Warrington, and S. Linderman, “Sim- plified state space layers for sequence modeling,” in Proceedings of ICLR, 2023

  26. [26]

    Vision mamba: Efficient visual representation learning with bidirectional state space model,

    L. Zhu, B. Liao, Q. Zhang, X. Wang, W. Liu, and X. Wang, “Vision mamba: Efficient visual representation learning with bidirectional state space model,” inPro- ceedings of ICLR, 2024, pp. 62 429–62 442

  27. [27]

    Vmamba: Visual state space model,

    Y . Liu, Y . Tian, Y . Zhao, H. Yu, L. Xie, Y . Wang, Q. Ye, J. Jiao, and Y . Liu, “Vmamba: Visual state space model,” inProceedings of NeurIPS, vol. 37, 2024, pp. 103 031– 103 063. 14

  28. [28]

    LocalMamba: Visual state space model with windowed selective scan,

    T. Huang, X. Pei, S. You, F. Wang, C. Qian, and C. Xu, “LocalMamba: Visual state space model with windowed selective scan,” inProceedings of ECCV, 2024, pp. 12– 22

  29. [29]

    Multi-scale VMamba: Hierarchy in hierarchy visual state space model,

    Y . Shi, M. Dong, and C. Xu, “Multi-scale VMamba: Hierarchy in hierarchy visual state space model,” in Proceedings of NeurIPS, vol. 37, 2024, pp. 25 687– 25 708

  30. [30]

    Mamba-Reg: Vision mamba also needs registers,

    F. Wang, J. Wang, S. Ren, G. Wei, J. Mei, W. Shao, Y . Zhou, A. Yuille, and C. Xie, “Mamba-Reg: Vision mamba also needs registers,” inProceedings of CVPR, 2025, pp. 14 944–14 953

  31. [31]

    Dynamic vision mamba.arXiv preprint arXiv:2504.04787, 2025

    M. Wu, Z. Li, Z. Liang, M. Li, X. Zhao, S. Khaki, Z. Zhu, X. Peng, K. N. Plataniotis, K. Wanget al., “Dynamic vision Mamba,”arXiv preprint arXiv:2504.04787, 2025

  32. [32]

    Demystify Mamba in vision: A linear attention perspective,

    D. Han, Z. Wang, Z. Xia, Y . Han, Y . Pu, C. Ge, J. Song, S. Song, B. Zheng, and G. Huang, “Demystify Mamba in vision: A linear attention perspective,” inProceedings of NeurIPS, vol. 37, 2024, pp. 127 181–127 203

  33. [33]

    Exploring token pruning in vision state space models,

    Z. Zhan, Z. Kong, Y . Gong, Y . Wu, Z. Meng, H. Zheng, X. Shen, S. Ioannidis, W. Niu, P. Zhaoet al., “Exploring token pruning in vision state space models,” inProceed- ings of NeurIPS, vol. 37, 2024, pp. 50 952–50 971

  34. [34]

    Rethinking token reduc- tion for state space models,

    Z. Zhan, Y . Wu, Z. Kong, C. Yang, Y . Gong, X. Shen, X. Lin, P. Zhao, and Y . Wang, “Rethinking token reduc- tion for state space models,” inProceedings of EMNLP, 2024, pp. 1686–1697

  35. [35]

    Efficient unstruc- tured pruning of mamba state-space models for resource- constrained environments,

    I. F. Shihab, S. Akter, and A. Sharma, “Efficient unstruc- tured pruning of mamba state-space models for resource- constrained environments,” inProceedings of EMNLP, 2025, pp. 11 109–11 137

  36. [36]

    Rate coding versus temporal order coding: what the retinal ganglion cells tell the visual cortex,

    R. Van Rullen and S. J. Thorpe, “Rate coding versus temporal order coding: what the retinal ganglion cells tell the visual cortex,”Neural Computation, vol. 13, no. 6, pp. 1255–1283, 2001

  37. [37]

    High-performance deep spiking neural networks with 0.3 spikes per neuron,

    A. Stanojevic, S. Wo ´zniak, G. Bellec, G. Cherubini, A. Pantazi, and W. Gerstner, “High-performance deep spiking neural networks with 0.3 spikes per neuron,” Nature Communications, vol. 15, no. 1, p. 6793, 2024

  38. [38]

    Enabling deep spiking neural networks with hybrid conversion and spike timing dependent backpropagation,

    N. Rathi, G. Srinivasan, P. Panda, and K. Roy, “Enabling deep spiking neural networks with hybrid conversion and spike timing dependent backpropagation,” inProceedings of ICLR, 2020

  39. [39]

    Temporal effi- cient training of spiking neural network via gradient re- weighting,

    S. Deng, Y . Li, S. Zhang, and S. Gu, “Temporal effi- cient training of spiking neural network via gradient re- weighting,” inProceedings of ICLR, 2022

  40. [40]

    Going deeper with directly-trained larger spiking neural net- works,

    H. Zheng, Y . Wu, L. Deng, Y . Hu, and G. Li, “Going deeper with directly-trained larger spiking neural net- works,” inProceedings of AAAI, vol. 35, no. 12, 2021, pp. 11 062–11 070

  41. [41]

    Advancing spiking neural networks toward deep residual learning,

    Y . Hu, L. Deng, Y . Wu, M. Yao, and G. Li, “Advancing spiking neural networks toward deep residual learning,” IEEE Transactions on Neural Networks and Learning Systems, vol. 36, no. 2, pp. 2353–2367, 2024

  42. [42]

    CIFAR-10 Dataset,

    A. Krizhevsky, V . Nair, and G. Hinton, “CIFAR-10 Dataset,” 2009, canadian Institute for Advanced Re- search

  43. [43]

    Imagenet: A large-scale hierarchical image database,

    J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in Proceedings of CVPR, 2009, pp. 248–255

  44. [44]

    CIFAR10- DVS: An Event-Stream Dataset for Object Classifica- tion,

    H. Li, H. Liu, X. Ji, G. Li, and L. Shi, “CIFAR10- DVS: An Event-Stream Dataset for Object Classifica- tion,”Frontiers in Neuroscience, vol. V olume 11 - 2017, 2017

  45. [45]

    A low power, fully event-based gesture recognition system,

    A. Amir, B. Taba, D. Berg, T. Melano, J. McKinstry, C. Di Nolfo, T. Nayak, A. Andreopoulos, G. Garreau, M. Mendozaet al., “A low power, fully event-based gesture recognition system,” inProceedings of CVPR, 2017, pp. 7243–7252

  46. [46]

    1.1 Computing’s energy problem (and what we can do about it),

    M. Horowitz, “1.1 Computing’s energy problem (and what we can do about it),” inProceedings of ISSCC, 2014, pp. 10–14

  47. [47]

    Differentiable spike: Rethinking gradient-descent for training spiking neural networks,

    Y . Li, Y . Guo, S. Zhang, S. Deng, Y . Hai, and S. Gu, “Differentiable spike: Rethinking gradient-descent for training spiking neural networks,” inProceedings of NeurIPS, vol. 34, 2021, pp. 23 426–23 439

  48. [48]

    Training high-performance low-latency spiking neural networks by differentiation on spike representa- tion,

    Q. Meng, M. Xiao, S. Yan, Y . Wang, Z. Lin, and Z.- Q. Luo, “Training high-performance low-latency spiking neural networks by differentiation on spike representa- tion,” inProceedings of CVPR, 2022, pp. 12 444–12 453

  49. [49]

    Spike attention coding for spiking neural networks,

    J. Liu, Y . Hu, G. Li, J. Pei, and L. Deng, “Spike attention coding for spiking neural networks,”IEEE Transactions on Neural Networks and Learning Systems, vol. 35, no. 12, pp. 18 892–18 898, 2023

  50. [50]

    Ternary spike: Learning ternary spikes for spiking neural networks,

    Y . Guo, Y . Chen, X. Liu, W. Peng, Y . Zhang, X. Huang, and Z. Ma, “Ternary spike: Learning ternary spikes for spiking neural networks,” inProceedings of AAAI, vol. 38, no. 11, 2024, pp. 12 244–12 252

  51. [51]

    TCJA-SNN: Temporal-channel joint attention for spiking neural networks,

    R.-J. Zhu, M. Zhang, Q. Zhao, H. Deng, Y . Duan, and L.- J. Deng, “TCJA-SNN: Temporal-channel joint attention for spiking neural networks,”IEEE Transactions on Neural Networks and Learning Systems, vol. 36, no. 3, pp. 5112–5125, 2024

  52. [52]

    FSTA- SNN: Frequency-based spatial-temporal attention module for spiking neural networks,

    K. Yu, T. Zhang, H. Wang, and Q. Xu, “FSTA- SNN: Frequency-based spatial-temporal attention module for spiking neural networks,” inProceedings of AAAI, vol. 39, no. 21, 2025, pp. 22 227–22 235

  53. [53]

    Neu- romorphic data augmentation for training spiking neural networks,

    Y . Li, Y . Kim, H. Park, T. Geller, and P. Panda, “Neu- romorphic data augmentation for training spiking neural networks,” inProceedings of ECCV. Springer, 2022, pp. 631–649

  54. [54]

    Toward scalable, efficient, and accurate deep spiking neural networks with backward residual connections, stochastic softmax, and hybridization,

    P. Panda, S. A. Aketi, and K. Roy, “Toward scalable, efficient, and accurate deep spiking neural networks with backward residual connections, stochastic softmax, and hybridization,”Frontiers in Neuroscience, vol. 14, p. 653, 2020

  55. [55]

    Efficient 3d recognition with event- driven spike sparse convolution,

    X. Qiu, M. Yao, J. Zhang, Y . Chou, N. Qiao, S. Zhou, B. Xu, and G. Li, “Efficient 3d recognition with event- driven spike sparse convolution,” inProceedings of AAAI, vol. 39, no. 19, 2025, pp. 20 086–20 094

  56. [56]

    Advancing spiking neural networks towards multiscale spatiotemporal interaction learning,

    Y . Shan, M. Zhang, R.-j. Zhu, X. Qiu, J. K. Eshraghian, and H. Qu, “Advancing spiking neural networks towards multiscale spatiotemporal interaction learning,” inPro- ceedings of AAAI, vol. 39, no. 2, 2025, pp. 1501–1509

  57. [57]

    High-performance temporal reversible 15 spiking neural networks withO(l)training memory and O(1)inference cost,

    J. Hu, M. Yao, X. Qiu, Y . Chou, Y . Cai, N. Qiao, Y . Tian, B. Xu, and G. Li, “High-performance temporal reversible 15 spiking neural networks withO(l)training memory and O(1)inference cost,” inProceedings of ICLR, vol. 235, 2024, pp. 19 516–19 530