SpikeTAD: Spiking Neural Networks for End-to-End Temporal Action Detection

Limin Wang; Min Yang; Mi Zhou

arxiv: 2606.12033 · v1 · pith:ZOE7L2P2new · submitted 2026-06-10 · 💻 cs.CV

SpikeTAD: Spiking Neural Networks for End-to-End Temporal Action Detection

Min Yang , Mi Zhou , Limin Wang This is my paper

Pith reviewed 2026-06-27 09:47 UTC · model grok-4.3

classification 💻 cs.CV

keywords spiking neural networkstemporal action detectionlow power consumptionvideo understandingend-to-end modelTHUMOS14 benchmarkActivityNet benchmark

0 comments

The pith

SpikeTAD is the first end-to-end spiking neural network for temporal action detection that reaches competitive accuracy at extremely low power.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SpikeTAD to apply spiking neural networks to temporal action detection, a key video understanding task. It targets the barriers of long conversion time-steps and accuracy loss that have blocked SNN use in video models. The resulting architecture performs end-to-end detection while using far less power than standard networks. On standard benchmarks it records 67.2 percent average mAP on THUMOS14 and 37.42 percent on ActivityNet-1.3. This outcome shows that low-power SNN models are viable for video tasks on mobile and neuromorphic hardware.

Core claim

We explore the application of SNNs on temporal action detection and propose the first SNN-based end-to-end TAD architecture coined as SpikeTAD. While maintaining extremely low power consumption, SpikeTAD achieves an average mAP of 67.2% in THUMOS14 and 37.42% in ActivityNet-1.3, demonstrating the feasibility of a low-power TAD model.

What carries the argument

SpikeTAD, the end-to-end spiking architecture that adapts SNN conversion techniques to the temporal action detection pipeline to avoid severe performance loss and long time-steps.

If this is right

SNNs become practical for full video understanding pipelines rather than classification alone.
Mobile devices can host action detection models that run on neuromorphic chips with minimal energy draw.
End-to-end TAD no longer requires high-power artificial neural network backbones.
Low-power video models open deployment on future mobile hardware that relies on spiking computation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same adaptation steps could be tested on related tasks such as temporal action segmentation or video captioning to check whether the performance retention generalizes.
Direct measurement of energy use on actual neuromorphic hardware would give a concrete number for the claimed power advantage over ANN baselines.
Hybrid networks that mix spiking layers with a few conventional layers might further improve the accuracy-power trade-off beyond what pure SpikeTAD achieves.

Load-bearing premise

Standard SNN conversion techniques or architectural adaptations can be applied to temporal action detection without large accuracy drops or excessively long conversion time-steps.

What would settle it

Run the converted SpikeTAD model on THUMOS14; accuracy well below 67 percent mAP combined with conversion time-steps in the thousands would falsify the feasibility claim.

Figures

Figures reproduced from arXiv: 2606.12033 by Limin Wang, Min Yang, Mi Zhou.

**Figure 1.** Figure 1: Comparison between SpikeTAD and other methods. We compared the energy consumption ratio and detection performance of SpikeTAD with other end-to-end TAD methods for THUMOS14 [5]. Due to the excessive power consumption of these ANN methods compared to SpikeTAD, we use log on the power consumption axis to visualize the power consumption gap between different methods. from action classification datasets [18, 1… view at source ↗

**Figure 2.** Figure 2: Diagram of the multi-threshold neuron. The multi-threshold neuron receives input from last module and emits up to one spike at each time-step t. 3.1.2. Multi-Threshold Neuron IF neurons are mostly used to replace ReLU activation from linear or convolution layers in CNNs, but they cannot replace the activation operations like GELU from non-linear operations in Transformers. These operations require interact… view at source ↗

**Figure 3.** Figure 3: Difference between semantic time and computational time. residual information lost during quantization, thereby reducing the dependency on extensive time-steps. Simultaneously, the MTN enhances the information capacity per time-step and mitigates fine-grained quantization errors. For the detector stage, where features are compressed into the temporal-only dimension (T), we utilize a simplified Integrate-a… view at source ↗

**Figure 4.** Figure 4: The overview of SpikeTAD. Our SpikeTAD consists of a backbone and detector. The backbone has L ViT blocks. We transfer the input into spikes using multi-threshold neuron layer. We also transfer non-linear modules like GELU, LayerNorm and Softmax (marked in green) into corresponding expectation compensation modules to preserve prior information after each time-step. The detector adopts simple maxpooling la… view at source ↗

**Figure 5.** Figure 5: Error analysis of SpikeTAD. There are error rates of 5 types on top-10G predictions, where G denotes the number of ground truths. culate the confidence interval, with degrees of freedom d f = 4, a significance level α = 0.05, a two-sided critical value tα/2,d f = 2.776, and the confidence interval is CI = µ ± tα/2 σ√ n . As shown in [PITH_FULL_IMAGE:figures/full_fig_p026_5.png] view at source ↗

read the original abstract

Video understanding is a crucial part of computer vision, with numerous application scenarios. With the increasing popularity of mobile devices, an increasing number of efforts are trying to deploy video understanding models on them. However, existing video understanding models are difficult to deploy due to their large size and prohibitive power consumption. Spiking Neural Networks (SNNs) have shown bioplausibility and low power advantages over Artificial Neural Networks (ANNs), especially on neuromorphic chips which are regarded as essential components of future mobile devices. However, excessively long conversion time-steps and severe performance degradation problems limit their application. To solve the problems above, we explore the application of SNNs on temporal action detection (TAD), which is an important task in video understanding, and propose the first SNN-based end-to-end TAD architecture coined as SpikeTAD. While maintaining extremely low power consumption, SpikeTAD achieves an average mAP of 67.2% in THUMOS14 and 37.42% in ActivityNet-1.3, demonstrating the feasibility of a low-power TAD model. Our code is available at https://github.com/MCG-NJU/SpikeTAD.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SpikeTAD is the first end-to-end SNN for temporal action detection and reports usable mAP on THUMOS14 and ActivityNet while claiming low power.

read the letter

The key takeaway is that this is the first spiking neural network designed for end-to-end temporal action detection. It reports 67.2% mAP on THUMOS14 and 37.42% on ActivityNet-1.3 while emphasizing low power use.

The paper applies SNNs to TAD and claims to overcome the typical barriers of long conversion time-steps and performance drops. Making the code public helps with checking the implementation.

This is new because prior SNN work has not targeted TAD in this way. The focus on mobile deployment makes sense given the power advantages of SNNs on neuromorphic chips.

The results suggest feasibility, but the abstract leaves out the specific architecture changes or training methods used to achieve the reported performance. Without those, it's hard to assess if the solution is general or tied to particular choices. The mAP values are given without direct ANN comparisons or ablation studies visible in the summary.

Readers working on neuromorphic vision or energy-efficient video analysis would get the most from this. It shows a path for SNNs beyond simple classification tasks.

The work deserves a serious referee. The claim is clear, the benchmarks are standard, and the code release allows for proper evaluation. I would recommend sending it for review rather than desk rejecting it.

Referee Report

1 major / 0 minor

Summary. The manuscript proposes SpikeTAD, presented as the first end-to-end spiking neural network architecture for temporal action detection (TAD). It claims to overcome the barriers of excessively long conversion time-steps and severe performance degradation that have limited SNN application to video understanding, while reporting an average mAP of 67.2% on THUMOS14 and 37.42% on ActivityNet-1.3 together with extremely low power consumption. The abstract states that code is released at a public repository.

Significance. If the reported mAP values are shown to arise from a properly validated end-to-end SNN pipeline that genuinely avoids the usual conversion-time and accuracy penalties, the result would establish the practical feasibility of low-power SNNs for a core video-understanding task and could support deployment on neuromorphic hardware. The work would therefore be of interest to both the neuromorphic-computing and video-analysis communities.

major comments (1)

[Abstract] Abstract: the feasibility claim rests on the assertion that SpikeTAD solves the long conversion time-step and performance-degradation problems, yet the abstract supplies neither the number of time-steps employed, the conversion method used, nor any direct ANN baseline comparison; without these quantities the central claim cannot be evaluated from the given text.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed comment on the abstract. We address it point-by-point below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: the feasibility claim rests on the assertion that SpikeTAD solves the long conversion time-step and performance-degradation problems, yet the abstract supplies neither the number of time-steps employed, the conversion method used, nor any direct ANN baseline comparison; without these quantities the central claim cannot be evaluated from the given text.

Authors: We agree that the abstract, as currently written, does not include the specific quantities needed to evaluate the central claim directly from the abstract alone. The manuscript body provides the time-step count, conversion details, and ANN comparisons in the experimental sections, but we acknowledge that these should be summarized in the abstract for clarity. In the revised version we will update the abstract to state the number of time-steps, the conversion method employed, and the direct ANN baseline mAP values. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper is an empirical proposal of an SNN architecture for TAD, reporting benchmark mAP numbers. No equations, derivations, predictions from fitted parameters, or load-bearing self-citations appear in the abstract or described content. No steps reduce by construction to inputs, so the derivation chain (if any) is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No methodological details available from abstract to populate free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5739 in / 1089 out tokens · 24053 ms · 2026-06-27T09:47:35.389895+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

49 extracted references · 9 canonical work pages

[1]

Safety and risk—why their definitions matter,

C. Zhang, J. Wu, Y . Li, Actionformer: Localizing moments of actions with trans- formers, in: Computer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part IV, V ol. 13664 of Lecture Notes in Computer Science, Springer, 2022, pp. 492–510. doi:10.1007/978-3- 031-19772-7_29

work page doi:10.1007/978-3- 2022
[2]

X. Liu, Q. Wang, Y . Hu, X. Tang, S. Zhang, S. Bai, X. Bai, End-to-end temporal action detection with transformer, IEEE Trans. Image Process. 31 (2022) 5427–

2022
[3]

doi:10.1109/TIP.2022.3195321

work page doi:10.1109/tip.2022.3195321 2022
[5]

M. Yang, H. Gao, P. Guo, L. Wang, Adapting short-term transformers for action detection in untrimmed videos, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 18570–18579

2024
[6]

Jiang, J

Y .-G. Jiang, J. Liu, A. Roshan Zamir, G. Toderici, I. Laptev, M. Shah, R. Suk- thankar, THUMOS challenge: Action recognition with a large number of classes, http://crcv.ucf.edu/THUMOS14/(2014). 28

2014
[7]

Caba Heilbron, V

F. Caba Heilbron, V . Escorcia, B. Ghanem, J. Carlos Niebles, Activitynet: A large-scale video benchmark for human activity understanding, in: Proceedings of the ieee conference on computer vision and pattern recognition, 2015, pp. 961–970

2015
[8]

X. Chen, Y . Guo, J. Liang, S. Zhuang, R. Zeng, X. Hu, Temporal action detection model compression by progressive block drop, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

2025
[9]

Maass, Networks of spiking neurons: the third generation of neural network models, Neural networks 10 (9) (1997) 1659–1671

W. Maass, Networks of spiking neurons: the third generation of neural network models, Neural networks 10 (9) (1997) 1659–1671

1997
[10]

W. Fang, Z. Yu, Y . Chen, T. Huang, T. Masquelier, Y . Tian, Deep residual learn- ing in spiking neural networks, Advances in Neural Information Processing Sys- tems 34 (2021) 21056–21069

2021
[11]

Z. Zhou, Y . Zhu, C. He, Y . Wang, S. Yan, Y . Tian, L. Yuan, Spikformer: When spiking neural network meets transformer, in: International Conference on Learning Representations (ICLR), 2023

2023
[12]

Liu, C.-L

S. Liu, C.-L. Zhang, C. Zhao, B. Ghanem, End-to-end temporal action detec- tion with 1b parameters across 1000 frames, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 18591–18601

2024
[13]

Cheng, G

F. Cheng, G. Bertasius, Tallformer: Temporal action localization with a long- memory transformer, in: Computer Vision - ECCV 2022 - 17th European Con- ference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XXXIV, V ol. 13694 of Lecture Notes in Computer Science, Springer, 2022, pp. 503–521. doi:10.1007/978-3-031-19830-4_29

work page doi:10.1007/978-3-031-19830-4_29 2022
[14]

C. Zhao, S. Liu, K. Mangalam, B. Ghanem, Re 2tal: Rewiring pretrained video backbones for reversible temporal action localization, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 10637–10647. 29

2023
[15]

M. Yang, G. Chen, Y . Zheng, T. Lu, L. Wang, Basictad: An astounding rgb-only baseline for temporal action detection, Comput. Vis. Image Underst. 232 (2023) 103692. doi:10.1016/J.CVIU.2023.103692

work page doi:10.1016/j.cviu.2023.103692 2023
[16]

Carreira, A

J. Carreira, A. Zisserman, Quo vadis, action recognition? A new model and the kinetics dataset, in: 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, IEEE Com- puter Society, 2017, pp. 4724–4733. doi:10.1109/CVPR.2017.502

work page doi:10.1109/cvpr.2017.502 2017
[17]

L. Wang, Y . Xiong, Z. Wang, Y . Qiao, D. Lin, X. Tang, L. V . Gool, Tempo- ral segment networks: Towards good practices for deep action recognition, in: Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VIII, V ol. 9912 of Lecture Notes in Computer Science, Springer, 2016, pp. 20–36...

work page doi:10.1007/978-3-319- 2016
[18]

L. Wang, B. Huang, Z. Zhao, Z. Tong, Y . He, Y . Wang, Y . Wang, Y . Qiao, Videomae V2: scaling video masked autoencoders with dual masking, in: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, IEEE, 2023, pp. 14549–14560. doi:10.1109/CVPR52729.2023.01398

work page doi:10.1109/cvpr52729.2023.01398 2023
[19]

W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, M. Suleyman, A. Zisserman, The kinetics human action video dataset, CoRR abs/1705.06950 (2017)

Pith/arXiv arXiv 2017
[20]

Y . Wang, Y . He, Y . Li, K. Li, J. Yu, X. Ma, X. Li, G. Chen, X. Chen, Y . Wang, P. Luo, Z. Liu, Y . Wang, L. Wang, Y . Qiao, Internvid: A large-scale video-text dataset for multimodal understanding and generation, in: The Twelfth Interna- tional Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024, OpenReview.net, 2024

2024
[21]

Huang, X

Z. Huang, X. Shi, Z. Hao, T. Bu, J. Ding, Z. Yu, T. Huang, Towards high- performance spiking transformers from ann to snn conversion, in: Proceedings 30 of the 32nd ACM International Conference on Multimedia, 2024, pp. 10688– 10697

2024
[22]

Huang, W

Z. Huang, W. Fang, T. Bu, P. Xue, Z. Hao, W. Liu, Y . Tang, Z. Yu, T. Huang, Dif- ferential coding for training-free ann-to-snn conversion, in: International Con- ference on Machine Learning (ICML), 2025

2025
[23]

T. Bu, W. Fang, J. Ding, P. Dai, Z. Yu, T. Huang, Optimal ann-snn conversion for high-accuracy and ultra-low-latency spiking neural networks, in: International Conference on Learning Representations (ICLR), 2022

2022
[24]

T. N. Tang, K. Kim, K. Sohn, Temporalmaxer: Maximize temporal con- text with only max pooling for temporal action localization, arXiv preprint arXiv:2303.09055 (2023)

arXiv 2023
[25]

C. Lin, C. Xu, D. Luo, Y . Wang, Y . Tai, C. Wang, J. Li, F. Huang, Y . Fu, Learning salient boundary feature for anchor-free temporal action localization, in: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021, Computer Vision Foundation/IEEE, 2021, pp. 3320–3329. doi:10.1109/CVPR46437.2021.00333

work page doi:10.1109/cvpr46437.2021.00333 2021
[26]

T. Bu, J. Ding, Z. Yu, T. Huang, Optimized potential initialization for low- latency spiking neural networks, in: Proceedings of the AAAI conference on artificial intelligence, V ol. 36, 2022, pp. 11–20

2022
[27]

Z. Hao, J. Ding, T. Bu, T. Huang, Z. Yu, Bridging the gap between anns and snns by calibrating offset spikes, in: International Conference on Learning Represen- tations (ICLR), 2023

2023
[28]

B. Han, K. Roy, Deep spiking neural network: Energy efficiency through time based coding, in: European conference on computer vision, Springer, 2020, pp. 388–404

2020
[29]

Z. Hao, T. Bu, J. Ding, T. Huang, Z. Yu, Reducing ann-snn conversion error through residual membrane potential, in: Proceedings of the AAAI conference on artificial intelligence, V ol. 37, 2023, pp. 11–21. 31

2023
[30]

Zhang, J

M. Zhang, J. Wang, J. Wu, A. Belatreche, B. Amornpaisannon, Z. Zhang, V . P. K. Miriyala, H. Qu, Y . Chua, T. E. Carlson, et al., Rectified linear postsynaptic potential function for backpropagation in deep spiking neural networks, IEEE transactions on neural networks and learning systems 33 (5) (2021) 1947–1958

2021
[31]

Y . Li, Y . Guo, S. Zhang, S. Deng, Y . Hai, S. Gu, Differentiable spike: Rethink- ing gradient-descent for training spiking neural networks, Advances in neural information processing systems 34 (2021) 23426–23439

2021
[32]

Y . Guo, Y . Chen, L. Zhang, Y . Wang, X. Liu, X. Tong, Y . Ou, X. Huang, Z. Ma, Reducing information loss for spiking neural networks, in: European Conference on Computer Vision, Springer, 2022, pp. 36–52

2022
[33]

Y . Guo, Y . Chen, L. Zhang, X. Liu, Y . Wang, X. Huang, Z. Ma, Im-loss: in- formation maximization loss for spiking neural networks, Advances in Neural Information Processing Systems 35 (2022) 156–166

2022
[34]

Y . Guo, W. Peng, X. Liu, Y . Chen, Y . Zhang, X. Tong, Z. Jie, Z. Ma, Enof- snn: Training accurate spiking neural networks via enhancing the output feature, Advances in Neural Information Processing Systems 37 (2024) 51708–51726

2024
[35]

E. O. Neftci, H. Mostafa, F. Zenke, Surrogate gradient learning in spiking neural networks: Bringing the power of gradient-based optimization to spiking neural networks, IEEE Signal Processing Magazine 36 (6) (2019) 51–63

2019
[36]

Caporale, Y

N. Caporale, Y . Dan, Spike timing–dependent plasticity: a hebbian learning rule, Annu. Rev. Neurosci. 31 (1) (2008) 25–46

2008
[37]

C. Li, L. Ma, S. Furber, Quantization framework for fast spiking neural net- works, Frontiers in Neuroscience 16 (2022) 918793

2022
[38]

T. Lin, P. Goyal, R. B. Girshick, K. He, P. Dollár, Focal loss for dense object detection, in: IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, IEEE Computer Society, 2017, pp. 2999–

2017
[39]

doi:10.1109/ICCV .2017.324. 32

work page doi:10.1109/iccv 2017
[40]

Zheng, P

Z. Zheng, P. Wang, W. Liu, J. Li, R. Ye, D. Ren, Distance-iou loss: Faster and better learning for bounding box regression, in: Proceedings of the AAAI con- ference on artificial intelligence, V ol. 34, 2020, pp. 12993–13000

2020
[41]

P. A. Merolla, J. V . Arthur, R. Alvarez-Icaza, A. S. Cassidy, J. Sawada, F. Akopyan, B. L. Jackson, N. Imam, C. Guo, Y . Nakamura, et al., A million spiking-neuron integrated circuit with a scalable communication network and interface, Science 345 (6197) (2014) 668–673

2014
[42]

M. Horowitz, 1.1 computing’s energy problem (and what we can do about it), in: 2014 IEEE international solid-state circuits conference digest of technical papers (ISSCC), IEEE, 2014, pp. 10–14

2014
[43]

G. Chen, P. Peng, G. Li, Y . Tian, Training full spike neural networks via auxiliary accumulation pathway, CoRR abs/2301.11929 (2023)

arXiv 2023
[44]

Loshchilov, F

I. Loshchilov, F. Hutter, Decoupled weight decay regularization, in: 7th Interna- tional Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019, OpenReview.net, 2019

2019
[45]

Feichtenhofer, H

C. Feichtenhofer, H. Fan, J. Malik, K. He, Slowfast networks for video recog- nition, in: Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 6202–6211

2019
[46]

D. Shi, Y . Zhong, Q. Cao, L. Ma, J. Li, D. Tao, Tridet: Temporal action detection with relative boundary modeling, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 18857–18866

2023
[47]

Z. Liu, J. Ning, Y . Cao, Y . Wei, Z. Zhang, S. Lin, H. Hu, Video swin transformer, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 3202–3211

2022
[48]

X. Luo, M. Yao, Y . Chou, B. Xu, G. Li, Integer-valued training and spike-driven inference spiking neural network for high-performance and energy-efficient ob- ject detection, in: European Conference on Computer Vision, Springer, 2024, pp. 253–272. 33

2024
[49]

S. Kim, S. Park, B. Na, S. Yoon, Spiking-yolo: spiking neural network for energy-efficient object detection, in: Proceedings of the AAAI conference on artificial intelligence, V ol. 34, 2020, pp. 11270–11277

2020
[50]

Alwassel, F

H. Alwassel, F. C. Heilbron, V . Escorcia, B. Ghanem, Diagnosing error in tem- poral action detectors, in: Proceedings of the European conference on computer vision (ECCV), 2018, pp. 256–272. 34

2018

[1] [1]

Safety and risk—why their definitions matter,

C. Zhang, J. Wu, Y . Li, Actionformer: Localizing moments of actions with trans- formers, in: Computer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part IV, V ol. 13664 of Lecture Notes in Computer Science, Springer, 2022, pp. 492–510. doi:10.1007/978-3- 031-19772-7_29

work page doi:10.1007/978-3- 2022

[2] [2]

X. Liu, Q. Wang, Y . Hu, X. Tang, S. Zhang, S. Bai, X. Bai, End-to-end temporal action detection with transformer, IEEE Trans. Image Process. 31 (2022) 5427–

2022

[3] [3]

doi:10.1109/TIP.2022.3195321

work page doi:10.1109/tip.2022.3195321 2022

[4] [5]

M. Yang, H. Gao, P. Guo, L. Wang, Adapting short-term transformers for action detection in untrimmed videos, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 18570–18579

2024

[5] [6]

Jiang, J

Y .-G. Jiang, J. Liu, A. Roshan Zamir, G. Toderici, I. Laptev, M. Shah, R. Suk- thankar, THUMOS challenge: Action recognition with a large number of classes, http://crcv.ucf.edu/THUMOS14/(2014). 28

2014

[6] [7]

Caba Heilbron, V

F. Caba Heilbron, V . Escorcia, B. Ghanem, J. Carlos Niebles, Activitynet: A large-scale video benchmark for human activity understanding, in: Proceedings of the ieee conference on computer vision and pattern recognition, 2015, pp. 961–970

2015

[7] [8]

X. Chen, Y . Guo, J. Liang, S. Zhuang, R. Zeng, X. Hu, Temporal action detection model compression by progressive block drop, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

2025

[8] [9]

Maass, Networks of spiking neurons: the third generation of neural network models, Neural networks 10 (9) (1997) 1659–1671

W. Maass, Networks of spiking neurons: the third generation of neural network models, Neural networks 10 (9) (1997) 1659–1671

1997

[9] [10]

W. Fang, Z. Yu, Y . Chen, T. Huang, T. Masquelier, Y . Tian, Deep residual learn- ing in spiking neural networks, Advances in Neural Information Processing Sys- tems 34 (2021) 21056–21069

2021

[10] [11]

Z. Zhou, Y . Zhu, C. He, Y . Wang, S. Yan, Y . Tian, L. Yuan, Spikformer: When spiking neural network meets transformer, in: International Conference on Learning Representations (ICLR), 2023

2023

[11] [12]

Liu, C.-L

S. Liu, C.-L. Zhang, C. Zhao, B. Ghanem, End-to-end temporal action detec- tion with 1b parameters across 1000 frames, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 18591–18601

2024

[12] [13]

Cheng, G

F. Cheng, G. Bertasius, Tallformer: Temporal action localization with a long- memory transformer, in: Computer Vision - ECCV 2022 - 17th European Con- ference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XXXIV, V ol. 13694 of Lecture Notes in Computer Science, Springer, 2022, pp. 503–521. doi:10.1007/978-3-031-19830-4_29

work page doi:10.1007/978-3-031-19830-4_29 2022

[13] [14]

C. Zhao, S. Liu, K. Mangalam, B. Ghanem, Re 2tal: Rewiring pretrained video backbones for reversible temporal action localization, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 10637–10647. 29

2023

[14] [15]

M. Yang, G. Chen, Y . Zheng, T. Lu, L. Wang, Basictad: An astounding rgb-only baseline for temporal action detection, Comput. Vis. Image Underst. 232 (2023) 103692. doi:10.1016/J.CVIU.2023.103692

work page doi:10.1016/j.cviu.2023.103692 2023

[15] [16]

Carreira, A

J. Carreira, A. Zisserman, Quo vadis, action recognition? A new model and the kinetics dataset, in: 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, IEEE Com- puter Society, 2017, pp. 4724–4733. doi:10.1109/CVPR.2017.502

work page doi:10.1109/cvpr.2017.502 2017

[16] [17]

L. Wang, Y . Xiong, Z. Wang, Y . Qiao, D. Lin, X. Tang, L. V . Gool, Tempo- ral segment networks: Towards good practices for deep action recognition, in: Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VIII, V ol. 9912 of Lecture Notes in Computer Science, Springer, 2016, pp. 20–36...

work page doi:10.1007/978-3-319- 2016

[17] [18]

L. Wang, B. Huang, Z. Zhao, Z. Tong, Y . He, Y . Wang, Y . Wang, Y . Qiao, Videomae V2: scaling video masked autoencoders with dual masking, in: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, IEEE, 2023, pp. 14549–14560. doi:10.1109/CVPR52729.2023.01398

work page doi:10.1109/cvpr52729.2023.01398 2023

[18] [19]

W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, M. Suleyman, A. Zisserman, The kinetics human action video dataset, CoRR abs/1705.06950 (2017)

Pith/arXiv arXiv 2017

[19] [20]

Y . Wang, Y . He, Y . Li, K. Li, J. Yu, X. Ma, X. Li, G. Chen, X. Chen, Y . Wang, P. Luo, Z. Liu, Y . Wang, L. Wang, Y . Qiao, Internvid: A large-scale video-text dataset for multimodal understanding and generation, in: The Twelfth Interna- tional Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024, OpenReview.net, 2024

2024

[20] [21]

Huang, X

Z. Huang, X. Shi, Z. Hao, T. Bu, J. Ding, Z. Yu, T. Huang, Towards high- performance spiking transformers from ann to snn conversion, in: Proceedings 30 of the 32nd ACM International Conference on Multimedia, 2024, pp. 10688– 10697

2024

[21] [22]

Huang, W

Z. Huang, W. Fang, T. Bu, P. Xue, Z. Hao, W. Liu, Y . Tang, Z. Yu, T. Huang, Dif- ferential coding for training-free ann-to-snn conversion, in: International Con- ference on Machine Learning (ICML), 2025

2025

[22] [23]

T. Bu, W. Fang, J. Ding, P. Dai, Z. Yu, T. Huang, Optimal ann-snn conversion for high-accuracy and ultra-low-latency spiking neural networks, in: International Conference on Learning Representations (ICLR), 2022

2022

[23] [24]

T. N. Tang, K. Kim, K. Sohn, Temporalmaxer: Maximize temporal con- text with only max pooling for temporal action localization, arXiv preprint arXiv:2303.09055 (2023)

arXiv 2023

[24] [25]

C. Lin, C. Xu, D. Luo, Y . Wang, Y . Tai, C. Wang, J. Li, F. Huang, Y . Fu, Learning salient boundary feature for anchor-free temporal action localization, in: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021, Computer Vision Foundation/IEEE, 2021, pp. 3320–3329. doi:10.1109/CVPR46437.2021.00333

work page doi:10.1109/cvpr46437.2021.00333 2021

[25] [26]

T. Bu, J. Ding, Z. Yu, T. Huang, Optimized potential initialization for low- latency spiking neural networks, in: Proceedings of the AAAI conference on artificial intelligence, V ol. 36, 2022, pp. 11–20

2022

[26] [27]

Z. Hao, J. Ding, T. Bu, T. Huang, Z. Yu, Bridging the gap between anns and snns by calibrating offset spikes, in: International Conference on Learning Represen- tations (ICLR), 2023

2023

[27] [28]

B. Han, K. Roy, Deep spiking neural network: Energy efficiency through time based coding, in: European conference on computer vision, Springer, 2020, pp. 388–404

2020

[28] [29]

Z. Hao, T. Bu, J. Ding, T. Huang, Z. Yu, Reducing ann-snn conversion error through residual membrane potential, in: Proceedings of the AAAI conference on artificial intelligence, V ol. 37, 2023, pp. 11–21. 31

2023

[29] [30]

Zhang, J

M. Zhang, J. Wang, J. Wu, A. Belatreche, B. Amornpaisannon, Z. Zhang, V . P. K. Miriyala, H. Qu, Y . Chua, T. E. Carlson, et al., Rectified linear postsynaptic potential function for backpropagation in deep spiking neural networks, IEEE transactions on neural networks and learning systems 33 (5) (2021) 1947–1958

2021

[30] [31]

Y . Li, Y . Guo, S. Zhang, S. Deng, Y . Hai, S. Gu, Differentiable spike: Rethink- ing gradient-descent for training spiking neural networks, Advances in neural information processing systems 34 (2021) 23426–23439

2021

[31] [32]

Y . Guo, Y . Chen, L. Zhang, Y . Wang, X. Liu, X. Tong, Y . Ou, X. Huang, Z. Ma, Reducing information loss for spiking neural networks, in: European Conference on Computer Vision, Springer, 2022, pp. 36–52

2022

[32] [33]

Y . Guo, Y . Chen, L. Zhang, X. Liu, Y . Wang, X. Huang, Z. Ma, Im-loss: in- formation maximization loss for spiking neural networks, Advances in Neural Information Processing Systems 35 (2022) 156–166

2022

[33] [34]

Y . Guo, W. Peng, X. Liu, Y . Chen, Y . Zhang, X. Tong, Z. Jie, Z. Ma, Enof- snn: Training accurate spiking neural networks via enhancing the output feature, Advances in Neural Information Processing Systems 37 (2024) 51708–51726

2024

[34] [35]

E. O. Neftci, H. Mostafa, F. Zenke, Surrogate gradient learning in spiking neural networks: Bringing the power of gradient-based optimization to spiking neural networks, IEEE Signal Processing Magazine 36 (6) (2019) 51–63

2019

[35] [36]

Caporale, Y

N. Caporale, Y . Dan, Spike timing–dependent plasticity: a hebbian learning rule, Annu. Rev. Neurosci. 31 (1) (2008) 25–46

2008

[36] [37]

C. Li, L. Ma, S. Furber, Quantization framework for fast spiking neural net- works, Frontiers in Neuroscience 16 (2022) 918793

2022

[37] [38]

T. Lin, P. Goyal, R. B. Girshick, K. He, P. Dollár, Focal loss for dense object detection, in: IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, IEEE Computer Society, 2017, pp. 2999–

2017

[38] [39]

doi:10.1109/ICCV .2017.324. 32

work page doi:10.1109/iccv 2017

[39] [40]

Zheng, P

Z. Zheng, P. Wang, W. Liu, J. Li, R. Ye, D. Ren, Distance-iou loss: Faster and better learning for bounding box regression, in: Proceedings of the AAAI con- ference on artificial intelligence, V ol. 34, 2020, pp. 12993–13000

2020

[40] [41]

P. A. Merolla, J. V . Arthur, R. Alvarez-Icaza, A. S. Cassidy, J. Sawada, F. Akopyan, B. L. Jackson, N. Imam, C. Guo, Y . Nakamura, et al., A million spiking-neuron integrated circuit with a scalable communication network and interface, Science 345 (6197) (2014) 668–673

2014

[41] [42]

M. Horowitz, 1.1 computing’s energy problem (and what we can do about it), in: 2014 IEEE international solid-state circuits conference digest of technical papers (ISSCC), IEEE, 2014, pp. 10–14

2014

[42] [43]

G. Chen, P. Peng, G. Li, Y . Tian, Training full spike neural networks via auxiliary accumulation pathway, CoRR abs/2301.11929 (2023)

arXiv 2023

[43] [44]

Loshchilov, F

I. Loshchilov, F. Hutter, Decoupled weight decay regularization, in: 7th Interna- tional Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019, OpenReview.net, 2019

2019

[44] [45]

Feichtenhofer, H

C. Feichtenhofer, H. Fan, J. Malik, K. He, Slowfast networks for video recog- nition, in: Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 6202–6211

2019

[45] [46]

D. Shi, Y . Zhong, Q. Cao, L. Ma, J. Li, D. Tao, Tridet: Temporal action detection with relative boundary modeling, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 18857–18866

2023

[46] [47]

Z. Liu, J. Ning, Y . Cao, Y . Wei, Z. Zhang, S. Lin, H. Hu, Video swin transformer, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 3202–3211

2022

[47] [48]

X. Luo, M. Yao, Y . Chou, B. Xu, G. Li, Integer-valued training and spike-driven inference spiking neural network for high-performance and energy-efficient ob- ject detection, in: European Conference on Computer Vision, Springer, 2024, pp. 253–272. 33

2024

[48] [49]

S. Kim, S. Park, B. Na, S. Yoon, Spiking-yolo: spiking neural network for energy-efficient object detection, in: Proceedings of the AAAI conference on artificial intelligence, V ol. 34, 2020, pp. 11270–11277

2020

[49] [50]

Alwassel, F

H. Alwassel, F. C. Heilbron, V . Escorcia, B. Ghanem, Diagnosing error in tem- poral action detectors, in: Proceedings of the European conference on computer vision (ECCV), 2018, pp. 256–272. 34

2018