SpikeTAD: Spiking Neural Networks for End-to-End Temporal Action Detection
Pith reviewed 2026-06-27 09:47 UTC · model grok-4.3
The pith
SpikeTAD is the first end-to-end spiking neural network for temporal action detection that reaches competitive accuracy at extremely low power.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We explore the application of SNNs on temporal action detection and propose the first SNN-based end-to-end TAD architecture coined as SpikeTAD. While maintaining extremely low power consumption, SpikeTAD achieves an average mAP of 67.2% in THUMOS14 and 37.42% in ActivityNet-1.3, demonstrating the feasibility of a low-power TAD model.
What carries the argument
SpikeTAD, the end-to-end spiking architecture that adapts SNN conversion techniques to the temporal action detection pipeline to avoid severe performance loss and long time-steps.
If this is right
- SNNs become practical for full video understanding pipelines rather than classification alone.
- Mobile devices can host action detection models that run on neuromorphic chips with minimal energy draw.
- End-to-end TAD no longer requires high-power artificial neural network backbones.
- Low-power video models open deployment on future mobile hardware that relies on spiking computation.
Where Pith is reading between the lines
- The same adaptation steps could be tested on related tasks such as temporal action segmentation or video captioning to check whether the performance retention generalizes.
- Direct measurement of energy use on actual neuromorphic hardware would give a concrete number for the claimed power advantage over ANN baselines.
- Hybrid networks that mix spiking layers with a few conventional layers might further improve the accuracy-power trade-off beyond what pure SpikeTAD achieves.
Load-bearing premise
Standard SNN conversion techniques or architectural adaptations can be applied to temporal action detection without large accuracy drops or excessively long conversion time-steps.
What would settle it
Run the converted SpikeTAD model on THUMOS14; accuracy well below 67 percent mAP combined with conversion time-steps in the thousands would falsify the feasibility claim.
Figures
read the original abstract
Video understanding is a crucial part of computer vision, with numerous application scenarios. With the increasing popularity of mobile devices, an increasing number of efforts are trying to deploy video understanding models on them. However, existing video understanding models are difficult to deploy due to their large size and prohibitive power consumption. Spiking Neural Networks (SNNs) have shown bioplausibility and low power advantages over Artificial Neural Networks (ANNs), especially on neuromorphic chips which are regarded as essential components of future mobile devices. However, excessively long conversion time-steps and severe performance degradation problems limit their application. To solve the problems above, we explore the application of SNNs on temporal action detection (TAD), which is an important task in video understanding, and propose the first SNN-based end-to-end TAD architecture coined as SpikeTAD. While maintaining extremely low power consumption, SpikeTAD achieves an average mAP of 67.2% in THUMOS14 and 37.42% in ActivityNet-1.3, demonstrating the feasibility of a low-power TAD model. Our code is available at https://github.com/MCG-NJU/SpikeTAD.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes SpikeTAD, presented as the first end-to-end spiking neural network architecture for temporal action detection (TAD). It claims to overcome the barriers of excessively long conversion time-steps and severe performance degradation that have limited SNN application to video understanding, while reporting an average mAP of 67.2% on THUMOS14 and 37.42% on ActivityNet-1.3 together with extremely low power consumption. The abstract states that code is released at a public repository.
Significance. If the reported mAP values are shown to arise from a properly validated end-to-end SNN pipeline that genuinely avoids the usual conversion-time and accuracy penalties, the result would establish the practical feasibility of low-power SNNs for a core video-understanding task and could support deployment on neuromorphic hardware. The work would therefore be of interest to both the neuromorphic-computing and video-analysis communities.
major comments (1)
- [Abstract] Abstract: the feasibility claim rests on the assertion that SpikeTAD solves the long conversion time-step and performance-degradation problems, yet the abstract supplies neither the number of time-steps employed, the conversion method used, nor any direct ANN baseline comparison; without these quantities the central claim cannot be evaluated from the given text.
Simulated Author's Rebuttal
We thank the referee for the detailed comment on the abstract. We address it point-by-point below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Abstract] Abstract: the feasibility claim rests on the assertion that SpikeTAD solves the long conversion time-step and performance-degradation problems, yet the abstract supplies neither the number of time-steps employed, the conversion method used, nor any direct ANN baseline comparison; without these quantities the central claim cannot be evaluated from the given text.
Authors: We agree that the abstract, as currently written, does not include the specific quantities needed to evaluate the central claim directly from the abstract alone. The manuscript body provides the time-step count, conversion details, and ANN comparisons in the experimental sections, but we acknowledge that these should be summarized in the abstract for clarity. In the revised version we will update the abstract to state the number of time-steps, the conversion method employed, and the direct ANN baseline mAP values. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper is an empirical proposal of an SNN architecture for TAD, reporting benchmark mAP numbers. No equations, derivations, predictions from fitted parameters, or load-bearing self-citations appear in the abstract or described content. No steps reduce by construction to inputs, so the derivation chain (if any) is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Safety and risk—why their definitions matter,
C. Zhang, J. Wu, Y . Li, Actionformer: Localizing moments of actions with trans- formers, in: Computer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part IV, V ol. 13664 of Lecture Notes in Computer Science, Springer, 2022, pp. 492–510. doi:10.1007/978-3- 031-19772-7_29
-
[2]
X. Liu, Q. Wang, Y . Hu, X. Tang, S. Zhang, S. Bai, X. Bai, End-to-end temporal action detection with transformer, IEEE Trans. Image Process. 31 (2022) 5427–
2022
-
[3]
doi:10.1109/TIP.2022.3195321
-
[5]
M. Yang, H. Gao, P. Guo, L. Wang, Adapting short-term transformers for action detection in untrimmed videos, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 18570–18579
2024
-
[6]
Jiang, J
Y .-G. Jiang, J. Liu, A. Roshan Zamir, G. Toderici, I. Laptev, M. Shah, R. Suk- thankar, THUMOS challenge: Action recognition with a large number of classes, http://crcv.ucf.edu/THUMOS14/(2014). 28
2014
-
[7]
Caba Heilbron, V
F. Caba Heilbron, V . Escorcia, B. Ghanem, J. Carlos Niebles, Activitynet: A large-scale video benchmark for human activity understanding, in: Proceedings of the ieee conference on computer vision and pattern recognition, 2015, pp. 961–970
2015
-
[8]
X. Chen, Y . Guo, J. Liang, S. Zhuang, R. Zeng, X. Hu, Temporal action detection model compression by progressive block drop, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025
2025
-
[9]
Maass, Networks of spiking neurons: the third generation of neural network models, Neural networks 10 (9) (1997) 1659–1671
W. Maass, Networks of spiking neurons: the third generation of neural network models, Neural networks 10 (9) (1997) 1659–1671
1997
-
[10]
W. Fang, Z. Yu, Y . Chen, T. Huang, T. Masquelier, Y . Tian, Deep residual learn- ing in spiking neural networks, Advances in Neural Information Processing Sys- tems 34 (2021) 21056–21069
2021
-
[11]
Z. Zhou, Y . Zhu, C. He, Y . Wang, S. Yan, Y . Tian, L. Yuan, Spikformer: When spiking neural network meets transformer, in: International Conference on Learning Representations (ICLR), 2023
2023
-
[12]
Liu, C.-L
S. Liu, C.-L. Zhang, C. Zhao, B. Ghanem, End-to-end temporal action detec- tion with 1b parameters across 1000 frames, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 18591–18601
2024
-
[13]
F. Cheng, G. Bertasius, Tallformer: Temporal action localization with a long- memory transformer, in: Computer Vision - ECCV 2022 - 17th European Con- ference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XXXIV, V ol. 13694 of Lecture Notes in Computer Science, Springer, 2022, pp. 503–521. doi:10.1007/978-3-031-19830-4_29
-
[14]
C. Zhao, S. Liu, K. Mangalam, B. Ghanem, Re 2tal: Rewiring pretrained video backbones for reversible temporal action localization, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 10637–10647. 29
2023
-
[15]
M. Yang, G. Chen, Y . Zheng, T. Lu, L. Wang, Basictad: An astounding rgb-only baseline for temporal action detection, Comput. Vis. Image Underst. 232 (2023) 103692. doi:10.1016/J.CVIU.2023.103692
-
[16]
J. Carreira, A. Zisserman, Quo vadis, action recognition? A new model and the kinetics dataset, in: 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, IEEE Com- puter Society, 2017, pp. 4724–4733. doi:10.1109/CVPR.2017.502
-
[17]
L. Wang, Y . Xiong, Z. Wang, Y . Qiao, D. Lin, X. Tang, L. V . Gool, Tempo- ral segment networks: Towards good practices for deep action recognition, in: Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VIII, V ol. 9912 of Lecture Notes in Computer Science, Springer, 2016, pp. 20–36...
-
[18]
L. Wang, B. Huang, Z. Zhao, Z. Tong, Y . He, Y . Wang, Y . Wang, Y . Qiao, Videomae V2: scaling video masked autoencoders with dual masking, in: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, IEEE, 2023, pp. 14549–14560. doi:10.1109/CVPR52729.2023.01398
-
[19]
W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, M. Suleyman, A. Zisserman, The kinetics human action video dataset, CoRR abs/1705.06950 (2017)
Pith/arXiv arXiv 2017
-
[20]
Y . Wang, Y . He, Y . Li, K. Li, J. Yu, X. Ma, X. Li, G. Chen, X. Chen, Y . Wang, P. Luo, Z. Liu, Y . Wang, L. Wang, Y . Qiao, Internvid: A large-scale video-text dataset for multimodal understanding and generation, in: The Twelfth Interna- tional Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024, OpenReview.net, 2024
2024
-
[21]
Huang, X
Z. Huang, X. Shi, Z. Hao, T. Bu, J. Ding, Z. Yu, T. Huang, Towards high- performance spiking transformers from ann to snn conversion, in: Proceedings 30 of the 32nd ACM International Conference on Multimedia, 2024, pp. 10688– 10697
2024
-
[22]
Huang, W
Z. Huang, W. Fang, T. Bu, P. Xue, Z. Hao, W. Liu, Y . Tang, Z. Yu, T. Huang, Dif- ferential coding for training-free ann-to-snn conversion, in: International Con- ference on Machine Learning (ICML), 2025
2025
-
[23]
T. Bu, W. Fang, J. Ding, P. Dai, Z. Yu, T. Huang, Optimal ann-snn conversion for high-accuracy and ultra-low-latency spiking neural networks, in: International Conference on Learning Representations (ICLR), 2022
2022
-
[24]
T. N. Tang, K. Kim, K. Sohn, Temporalmaxer: Maximize temporal con- text with only max pooling for temporal action localization, arXiv preprint arXiv:2303.09055 (2023)
arXiv 2023
-
[25]
C. Lin, C. Xu, D. Luo, Y . Wang, Y . Tai, C. Wang, J. Li, F. Huang, Y . Fu, Learning salient boundary feature for anchor-free temporal action localization, in: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021, Computer Vision Foundation/IEEE, 2021, pp. 3320–3329. doi:10.1109/CVPR46437.2021.00333
-
[26]
T. Bu, J. Ding, Z. Yu, T. Huang, Optimized potential initialization for low- latency spiking neural networks, in: Proceedings of the AAAI conference on artificial intelligence, V ol. 36, 2022, pp. 11–20
2022
-
[27]
Z. Hao, J. Ding, T. Bu, T. Huang, Z. Yu, Bridging the gap between anns and snns by calibrating offset spikes, in: International Conference on Learning Represen- tations (ICLR), 2023
2023
-
[28]
B. Han, K. Roy, Deep spiking neural network: Energy efficiency through time based coding, in: European conference on computer vision, Springer, 2020, pp. 388–404
2020
-
[29]
Z. Hao, T. Bu, J. Ding, T. Huang, Z. Yu, Reducing ann-snn conversion error through residual membrane potential, in: Proceedings of the AAAI conference on artificial intelligence, V ol. 37, 2023, pp. 11–21. 31
2023
-
[30]
Zhang, J
M. Zhang, J. Wang, J. Wu, A. Belatreche, B. Amornpaisannon, Z. Zhang, V . P. K. Miriyala, H. Qu, Y . Chua, T. E. Carlson, et al., Rectified linear postsynaptic potential function for backpropagation in deep spiking neural networks, IEEE transactions on neural networks and learning systems 33 (5) (2021) 1947–1958
2021
-
[31]
Y . Li, Y . Guo, S. Zhang, S. Deng, Y . Hai, S. Gu, Differentiable spike: Rethink- ing gradient-descent for training spiking neural networks, Advances in neural information processing systems 34 (2021) 23426–23439
2021
-
[32]
Y . Guo, Y . Chen, L. Zhang, Y . Wang, X. Liu, X. Tong, Y . Ou, X. Huang, Z. Ma, Reducing information loss for spiking neural networks, in: European Conference on Computer Vision, Springer, 2022, pp. 36–52
2022
-
[33]
Y . Guo, Y . Chen, L. Zhang, X. Liu, Y . Wang, X. Huang, Z. Ma, Im-loss: in- formation maximization loss for spiking neural networks, Advances in Neural Information Processing Systems 35 (2022) 156–166
2022
-
[34]
Y . Guo, W. Peng, X. Liu, Y . Chen, Y . Zhang, X. Tong, Z. Jie, Z. Ma, Enof- snn: Training accurate spiking neural networks via enhancing the output feature, Advances in Neural Information Processing Systems 37 (2024) 51708–51726
2024
-
[35]
E. O. Neftci, H. Mostafa, F. Zenke, Surrogate gradient learning in spiking neural networks: Bringing the power of gradient-based optimization to spiking neural networks, IEEE Signal Processing Magazine 36 (6) (2019) 51–63
2019
-
[36]
Caporale, Y
N. Caporale, Y . Dan, Spike timing–dependent plasticity: a hebbian learning rule, Annu. Rev. Neurosci. 31 (1) (2008) 25–46
2008
-
[37]
C. Li, L. Ma, S. Furber, Quantization framework for fast spiking neural net- works, Frontiers in Neuroscience 16 (2022) 918793
2022
-
[38]
T. Lin, P. Goyal, R. B. Girshick, K. He, P. Dollár, Focal loss for dense object detection, in: IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, IEEE Computer Society, 2017, pp. 2999–
2017
-
[39]
doi:10.1109/ICCV .2017.324. 32
-
[40]
Zheng, P
Z. Zheng, P. Wang, W. Liu, J. Li, R. Ye, D. Ren, Distance-iou loss: Faster and better learning for bounding box regression, in: Proceedings of the AAAI con- ference on artificial intelligence, V ol. 34, 2020, pp. 12993–13000
2020
-
[41]
P. A. Merolla, J. V . Arthur, R. Alvarez-Icaza, A. S. Cassidy, J. Sawada, F. Akopyan, B. L. Jackson, N. Imam, C. Guo, Y . Nakamura, et al., A million spiking-neuron integrated circuit with a scalable communication network and interface, Science 345 (6197) (2014) 668–673
2014
-
[42]
M. Horowitz, 1.1 computing’s energy problem (and what we can do about it), in: 2014 IEEE international solid-state circuits conference digest of technical papers (ISSCC), IEEE, 2014, pp. 10–14
2014
-
[43]
G. Chen, P. Peng, G. Li, Y . Tian, Training full spike neural networks via auxiliary accumulation pathway, CoRR abs/2301.11929 (2023)
arXiv 2023
-
[44]
Loshchilov, F
I. Loshchilov, F. Hutter, Decoupled weight decay regularization, in: 7th Interna- tional Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019, OpenReview.net, 2019
2019
-
[45]
Feichtenhofer, H
C. Feichtenhofer, H. Fan, J. Malik, K. He, Slowfast networks for video recog- nition, in: Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 6202–6211
2019
-
[46]
D. Shi, Y . Zhong, Q. Cao, L. Ma, J. Li, D. Tao, Tridet: Temporal action detection with relative boundary modeling, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 18857–18866
2023
-
[47]
Z. Liu, J. Ning, Y . Cao, Y . Wei, Z. Zhang, S. Lin, H. Hu, Video swin transformer, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 3202–3211
2022
-
[48]
X. Luo, M. Yao, Y . Chou, B. Xu, G. Li, Integer-valued training and spike-driven inference spiking neural network for high-performance and energy-efficient ob- ject detection, in: European Conference on Computer Vision, Springer, 2024, pp. 253–272. 33
2024
-
[49]
S. Kim, S. Park, B. Na, S. Yoon, Spiking-yolo: spiking neural network for energy-efficient object detection, in: Proceedings of the AAAI conference on artificial intelligence, V ol. 34, 2020, pp. 11270–11277
2020
-
[50]
Alwassel, F
H. Alwassel, F. C. Heilbron, V . Escorcia, B. Ghanem, Diagnosing error in tem- poral action detectors, in: Proceedings of the European conference on computer vision (ECCV), 2018, pp. 256–272. 34
2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.